New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MON-977 pkg/manifests: rotate grpc key material gracefully #816
MON-977 pkg/manifests: rotate grpc key material gracefully #816
Conversation
/retest |
the failure
seems to be unrelated, as it is not using grpc key material, but let's see if this failure is consistent. |
pkg/manifests/tls.go
Outdated
} | ||
|
||
crt, crtPresent := s.Data["ca.crt"] | ||
key, keyPresent := s.Data["ca.key"] | ||
if _, ok := s.Annotations["monitoring.openshift.io/grpc-forced-rotate"]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if _, ok := s.Annotations["monitoring.openshift.io/grpc-forced-rotate"]; ok { | |
if _, ok := s.Annotations["monitoring.openshift.io/grpc-cert-forced-rotate"]; ok { |
Let's be a little bit more explicit what it is about
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not just the cert, it is the complete key material, including server/client certs and keys, hence i suggest grpc-tls-forced-rotate
to gather the whole context.
pkg/manifests/tls.go
Outdated
klog.Warningf("creating a new CA due to error reading CA: %v", err) | ||
} else if needsNewCert(curCA.Config.Certs[0].NotBefore, curCA.Config.Certs[0].NotAfter, time.Now) { | ||
rotate = true | ||
fmt.Println("##### certs expired") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoopsie
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happens to me as well often, its WIP. And no curse words included so all good ;)
4b5aa4e
to
797d921
Compare
@@ -59,6 +59,9 @@ func TestUserWorkloadMonitoring(t *testing.T) { | |||
{"assert user workload rules", assertUserWorkloadRules}, | |||
{"assert tenancy model is enforced", assertTenancyForMetrics}, | |||
{"assert prometheus and alertmanager is not deployed in user namespace", assertPrometheusAlertmanagerInUserNamespace}, | |||
{"assert grpc tls rotation", assertGRPCTLSRotation}, | |||
{"assert user workload metrics", assertUserWorkloadMetrics}, | |||
{"assert user workload rules", assertUserWorkloadRules}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @openshift/openshift-team-monitoring i am lacking a bit of creativity here tbh 🤔 do you have other ideas how we could test this, other than a) enforcing rotation and b) asserting if everything "still" works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be adding this to this existing test? I would prefer a separate test case here. TestGRPCTLSUserWorkloadMonitoring maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, that makes things a bit more complex though, as tests run in parallel and I'd like to execute rotation in the context of user workload monitoring too.
I'd need to refactor then the e2e test suite here so we test two (potentially concurrently running) user workload monitoring tests one dedicated for rotation, the other for asserting metrics/alerts. Do you think the added complexity is worth it? If yes, I'd go ahead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just feel like this test is growing, that is my main concern and we are testing too much in one test which would make it not clear what is failing. Its not a blocker for me, happy to have a follow up PR in 4.7 or after feature freeze to refactor this as well, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree, I feel on the edge here, the list is still well readable but I agree we should start thinking about a potential refactor. I submitted https://issues.redhat.com/browse/MON-1170 to track this and would suggest to tackle it as a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! That sounds good!
/test e2e-aws-operator |
1 similar comment
/test e2e-aws-operator |
Another failure in |
/test e2e-aws-operator |
build is green, but I would like to execute a couple more runs. |
/retest |
/test e2e-aws-operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! 🎉
One question -> do we trigger the controller resync on this file?
Couple of edge cases questions, otherwise lgtm
pkg/manifests/tls.go
Outdated
if err != nil { | ||
return nil, err | ||
klog.Warningf("creating a new CA due to error reading CA: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning msg is a bit hard to understand what actually we are warning the user? That we failed to get CA due to an error, there was no creation yet, correct?
We just continue here not return? Its because we create a new one later on, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a quirky edge case. It is hard as we are indeed flying blind what went wrong here. For some reason reading the existing CA failed (maybe someone modified it manually, or any other corruption happened)?
We have two options:
a) bail out here and crash-loop and wait for the user to resolve the corrupted CA (not good)
b) what I implemented here: log out the issue and simply re-issue a new CA.
wdyt?
if crtPresent && keyPresent { | ||
ca, err := crypto.GetCAFromBytes(crt, key) | ||
if !rotate { | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could there be a case where rotate is false and we could not get current CA or its nil? In that case we do nothing, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rotate is initialized already with rotate = !crtPresent || !keyPresent
, but you're right, I missed setting it to true if reading failed 👍
Let me add a unit test for this case too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, i do have a unit test for this edge case, let me recheck 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha, great catch, the unit test is wrong and hence also the logic, fixing 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pkg/manifests/tls.go
Outdated
} | ||
|
||
// CreateCertificate creates a new certificate and returns it in x509.Certificate form. | ||
func CreateCertificate(template, parent *x509.Certificate, pub, priv interface{}) (*x509.Certificate, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason this function should be exposed, seems like we only use it in this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh no reason, let me set it to private.
@openshift/openshift-team-monitoring ptal |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly nitpicks really apart from one question for unit tests, feel free to address the nits if you want, nothing too urgent at all!
/lgtm
/hold
feel free to remove hold if you want to skip addressing the nits! 🎉 🚢
} | ||
|
||
crt, crtPresent := s.Data["ca.crt"] | ||
key, keyPresent := s.Data["ca.key"] | ||
if _, ok := s.Annotations["monitoring.openshift.io/grpc-tls-forced-rotate"]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can we define these Annotations as part of the constant, to avoid typos. Happy to have it on function level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally not a fan of this, as those literals are intrinsic constants already (we don't follow that idiom elsewhere here and in many golang projects too), especially given the fact that this is not a library used externally.
pkg/manifests/tls.go
Outdated
sets.NewString("prometheus-grpc"), | ||
crypto.DefaultCertificateLifetimeInDays, | ||
) | ||
if err != nil { | ||
return nil, errors.Wrap(err, "error making prometheus-k8s server certificate") | ||
return errors.Wrap(err, "error making prometheus-k8s server certificate") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: What does prometheus-k8s mean to users? 🤔 e.g. the one we deploy in openshift-cluster-monitoring, correct? To users might be helpful to include that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a good point, we actually have only two certificate, one server cert (used in thanos ruler and prometheus) and one client cert. I will use these terms instead.
{ | ||
name: "force rotation", | ||
setup: func(s *v1.Secret) { | ||
s.Annotations["monitoring.openshift.io/grpc-tls-forced-rotate"] = "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add an Annotation withblah/foo-bar
? Would it make sense to have that testcase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expected behavior would be a no-op, not sure about the benefit, but I can add a test case
loaded, err := t.client.GetSecret(s.Namespace, s.Name) | ||
switch { | ||
case apierrors.IsNotFound(err): | ||
// No secret was found, proceed with the default empty secret from manifests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add debug logs here besides comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a great idea 👍
case apierrors.IsNotFound(err): | ||
// No secret was found, proceed with the default empty secret from manifests. | ||
case err == nil: | ||
// Secret was found, use that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as above, up to you if you think its needed as you were developing this.
// This method creates a central secret containing GRPC TLS key material | ||
// if the given secret is nil | ||
// or updates it if the CA present in the given secret is about to expire. | ||
func (f *Factory) GRPCSecret() (*v1.Secret, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can move this to the manifests.go now, that is purely just manifests generation? Not urgent to address just as we have this pattern already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, i was back and forth on this one. It could be considered the first step to untangle manifests.go
which is too big already. I would leave it here and consequently cut manifests.go
in separate files as a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the test for it is in pkg/manifests/manifests_test.go
, I would then either move this function there or move the test to tls_test.go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move the test to tls_test.go
indeed and set a first cutting precedence 👍
@lilic PTAL and unhold if you're fine with the proposed changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit about GRPCSecret being tested in manifests_test.go but not moved to manifests.go :) otherwise:
/lgtm
/hold
feel free to remove hold if you wanted it this way. 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
🎉
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lilic, s-urbaniak The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
This is the initial implementation for a graceful rotation based on the simplified scheme. Please take a look. It still misses a small e2e test which enforces rotation.
/cc @openshift/openshift-team-monitoring