New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable bound SA tokens #718
Enable bound SA tokens #718
Conversation
e2071a3
to
adec557
Compare
/retest |
adec557
to
dc66056
Compare
serviceAccountIssuer: auth.openshift.io | ||
apiAudiences: | ||
- auth.openshift.io | ||
serviceAccountSigningKeyFile: /etc/kubernetes/static-pod-certs/secrets/service-account-signing-key/service-account.key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't there are need to override this for bootstrapping? Or do we get this key from the render step that early?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent is not to enable bound tokens in the bootstrapping phase, since as per a discussion on the enhancement there does not appear to be a need for that. Will it be necessary to set these values in code that can detect that bootstrapping is complete rather than here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're ok not having bound tokens available from the bootstrap kubeapiserver. we asked in the enhancement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question was more whether this path is a valid one during bootstrap or whether the process dies if not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok. Will test tomorrow. Maybe the operator will have to detect that it is past the bootstrap phase and compose the config accordingly in code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two override yaml files, one for the bootstrap phase, one for after. Just set sensible values for each.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
c.queue.AddAfter(workQueueKey, readyInterval+10*time.Second) | ||
} | ||
|
||
certConfigMap, err := c.configMapClient.ConfigMaps(targetNamespace).Get(CertConfigMapName, metav1.GetOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if the secret is changed, but the config map is not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If its the operator secret that is changing, that is not a problem because it is not used directly by the apiserver instances. In the case of the operand secret, notice that promotion occurs only after the configmap has been successfully updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully the functional separation makes my intent clear?
// Giving time for apiserver instances to pick up the change in public keys before | ||
// changing the private key minimizes the potential for one or more apiservers to | ||
// issue tokens signed by the new private key that apiservers without the | ||
// corresponding public key are unable to validate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we lack back-pressure again here, right? If rolling update is blocked for some reason, we might get into trouble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still a problem now that actual state is considered rather than just waiting for a random amount of time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will read the new code. Looking at the deployed state should be enough.
// corresponding public key are unable to validate. | ||
// | ||
// TODO(marun) Find a more accurate indication that all apiservers are capable of | ||
// validating tokens signed by the new private key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the etcd encryption code, we wait until all API servers have settled on the same revision, and that there is no new pending revision. Maybe that approach works here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for having the CM revisioned.
Are the service account tokens private and pub keys read dynamically since we're just swapping them here without redeployment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point me to the etcd code in question? And should there be a corresponding update to the kcmo token controller?
re: dynamic key reads - afaict it's enough to just update the resource. Any changes to the resources/config that influence the state of apiserver pods prompt a redeployment of a single pod and only if that redeployment is successful will the change be rolled out to all pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this code, you're even ok to avoid demanding a stable level, you just need each revision on nodes to include the cert for your key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, PTAL.
@@ -106,6 +106,11 @@ servingInfo: | |||
requestTimeoutSeconds: 3600 | |||
serviceAccountPublicKeyFiles: | |||
- /etc/kubernetes/static-pod-resources/configmaps/sa-token-signing-certs | |||
- /etc/kubernetes/static-pod-certs/configmaps/bound-sa-token-signing-certs | |||
serviceAccountIssuer: auth.openshift.io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be mentioned as a default value in openshift/api#569
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, it wasn't clear to me that a value that wasn't set by default at the API level should be documented for the API type. Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
TokenReadyAnnotation = "kube-apiserver.openshift.io/ready-to-use" | ||
readyInterval = 5 * time.Minute | ||
|
||
CertConfigMapName = "bound-sa-token-signing-certs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: there will never be actual certs in this CM, will they?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but I'm being consistent with the name of the controller-manager equivalent (sa-tokens-signing-certs
) which has effectively the same content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, ok
} | ||
needKeypair := errors.IsNotFound(err) || len(signingSecret.Data[PrivateKeyKey]) == 0 || len(signingSecret.Data[PublicKeyKey]) == 0 | ||
if needKeypair { | ||
newSecret, err := newSigningSecret() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While ApplySecret
has certain output, a human-friendly log line might be helpful here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// corresponding public key are unable to validate. | ||
// | ||
// TODO(marun) Find a more accurate indication that all apiservers are capable of | ||
// validating tokens signed by the new private key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for having the CM revisioned.
Are the service account tokens private and pub keys read dynamically since we're just swapping them here without redeployment?
go wait.Until(func() { | ||
ticker := time.NewTicker(time.Minute) | ||
defer ticker.Stop() | ||
|
||
for { | ||
c.queue.Add(workQueueKey) | ||
select { | ||
case <-ticker.C: | ||
case <-stopCh: | ||
return | ||
} | ||
} | ||
|
||
}, time.Minute, stopCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It this correct? That inner ticker
looks quite unnecessary given that wait.Until
has its own internal timer which does the same, this basically runs wait.Until
and freezes it on its first loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right - it should be sufficient to call func() {c.queue.Add(workQueueKey)}
as the argument to wait.Util
. Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
dc66056
to
a4e341b
Compare
@@ -106,6 +106,11 @@ servingInfo: | |||
requestTimeoutSeconds: 3600 | |||
serviceAccountPublicKeyFiles: | |||
- /etc/kubernetes/static-pod-resources/configmaps/sa-token-signing-certs | |||
- /etc/kubernetes/static-pod-certs/configmaps/bound-sa-token-signing-certs | |||
serviceAccountIssuer: auth.openshift.io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please pass as a flag, not as a struct value. Same with all these values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're looking for apiServerArguments
above and add comments to help future me who won't remember this as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
needKeypair := errors.IsNotFound(err) || len(signingSecret.Data[PrivateKeyKey]) == 0 || len(signingSecret.Data[PublicKeyKey]) == 0 | ||
if needKeypair { | ||
klog.Infof("Creating a new signing secret for bound service account tokens.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's important enough for an info message, it's important enough for an event. If it isn't important enough for an event, then it belongs at a lower level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which would you prefer - even or lower level log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// did not have the latest public key would not be able to | ||
// validate those new tokens. | ||
TokenReadyAnnotation = "kube-apiserver.openshift.io/ready-to-use" | ||
readyInterval = 5 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have the ability to inspect the revisions to figure out which ones contains the key in question. You can then inspect the levels on the nodes (the kubeapiservers.operator.openshift.io) to know if the nodes actually have the levels required. No need for time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// passed (see comment above the ready interval constant). Do not return | ||
// immediately to ensure that the new public key can be set in the configmap | ||
// in advance of promotion. | ||
c.queue.AddAfter(workQueueKey, readyInterval+10*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you make your decision based on levels of the configmap actually on the nodes and trigger based on updates to kubeapiserver.operator.openshift.io, you don't need to have this delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
return ret | ||
} | ||
|
||
func (c *BoundSATokenSignerController) sync() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please break this function into logical bits with no data based between them. I see
- create a secret if needed. depends on nothing outside the secrets
- update configmap if needed. This can retrieve the current secret. A stale lister will always get another event notification, so staleness of a cache doesn't matter and this can then be contained.
- promoting a key. This is based on the current secret in the operator namespace, the current secret in the operand namespace, the revisions on the nodes, the content of the configmaps on those nodes. All of which can be cached.
If any step has an error, the other steps should still be run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
2fae090
to
84a6b05
Compare
} else { | ||
// Update the operand secret only if the current public key has been synced to | ||
// all nodes. | ||
syncRequired, err = c.publicKeySyncedToAllNodes(currPublicKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/syncRequired/syncAllowed/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
- /etc/kubernetes/static-pod-resources/configmaps/sa-token-signing-certs | ||
# The following path contains the public keys needed to verify bound sa | ||
# tokens. This is only supported post-bootstrap. | ||
- /etc/kubernetes/static-pod-resources/configmaps/bound-sa-token-signing-certs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just copy these into the overrides file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which overrides file? afaict the only ones available are for bootstrap. Would you prefer that I add another post-bootstrap override file rather than adding this feature-specific one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file has been renamed to config-overrides.yaml as requested.
@@ -246,6 +247,7 @@ func manageKubeAPIServerConfig(client coreclientv1.ConfigMapsGetter, recorder ev | |||
"config.yaml", | |||
specialMergeRules, | |||
defaultConfig, | |||
boundSATokenConfig, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't we have the overrides yaml here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaict the overrides yaml are only for bootstrap:
https://github.com/openshift/cluster-kube-apiserver-operator/tree/master/bindata/bootkube/config
I don't know why there are 2 files for bootstrap. I do know that putting these options in the either of the bootstrap overrides will break bootstrapping because the bound token keypair is only created post-bootstrap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've posted a new PR to document the requirement to put post-bootstrap overrides in the renamed file: #731
@sttts Updated, PTAL |
67caf60
to
f31bb60
Compare
This controller is modeled after the one that manages key material for legacy sa tokens in cluster-kube-controller-manager-operator, but is simplified by not having to enable bound sa tokens during bootstrap.
f31bb60
to
fac4e70
Compare
@sttts Updated, PTAL |
/test all |
@damemi I've added the |
test/e2e/operator_test.go
Outdated
@@ -56,10 +56,10 @@ func TestRevisionLimits(t *testing.T) { | |||
|
|||
totalRevisionLimit := operatorSpec.SucceededRevisionLimit + operatorSpec.FailedRevisionLimit | |||
if operatorSpec.SucceededRevisionLimit == 0 { | |||
totalRevisionLimit += 5 | |||
totalRevisionLimit += 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a follow-up: @damemi you wrote this original test. Why was this 5? Magic number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now get the +=5 I believe (it's the default value?). But why does the loop below not wait for the revision pruning controller to do its work, but errors the test immediately?
3d09bad
to
95152e3
Compare
The addition of bound token configuration - specifically keypair management - appears to be adding to the number of revisions just enough to break the test. The test needs to be updated to allow the pruning controller time to work.
95152e3
to
9cfee49
Compare
@@ -77,7 +77,9 @@ func TestRevisionLimits(t *testing.T) { | |||
// are InProgress or Unknown (since these do not count toward failed or succeeded), which could indicate zombie revisions. | |||
// Check total+1 to account for possibly a current new revision that just hasn't pruned off the oldest one yet. | |||
if len(newRevisions) > int(totalRevisionLimit)+1 { | |||
t.Errorf("more revisions (%v) than total allowed (%v): %+v", len(revisions), totalRevisionLimit, revisions) | |||
// TODO(marun) If number of revisions has been exceeded, need to give time for the pruning controller to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @damemi
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: marun, sttts The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
Implements operator support for openshift/enhancements#150
TODO
TokenRequest
API