Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MON-977 pkg/manifests: rotate grpc key material gracefully #816

Merged
merged 1 commit into from Jun 29, 2020

Conversation

s-urbaniak
Copy link
Contributor

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

This is the initial implementation for a graceful rotation based on the simplified scheme. Please take a look. It still misses a small e2e test which enforces rotation.

/cc @openshift/openshift-team-monitoring

@openshift-ci-robot openshift-ci-robot requested a review from a team June 18, 2020 10:19
@s-urbaniak s-urbaniak changed the title pkg/manifests: rotate grpc key material gracefully WIPpkg/manifests: rotate grpc key material gracefully Jun 18, 2020
@s-urbaniak s-urbaniak changed the title WIPpkg/manifests: rotate grpc key material gracefully WIP pkg/manifests: rotate grpc key material gracefully Jun 18, 2020
@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 18, 2020
@s-urbaniak
Copy link
Contributor Author

/retest

@s-urbaniak
Copy link
Contributor Author

the failure

    --- FAIL: TestUserWorkloadMonitoring/assert_user_workload_metrics (10.16s)
        client.go:270: unexpected status code response, want 200, got 403 ("\n<!DOCTYPE html>\n<html lang=\"en\" charset=\"utf-8\">\n<head>\n  <title>Log In</title>\n  <meta name=\"viewport\" content=\"width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no\">\n  <style>\n    @font-face {\n      font-family: \"Open Sans\";\n      src: url(data:application/x-font-woff;charset=utf-8;base64,d09GRgABAAAAAFeoABMAAAAAlkQAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAABGRlRNAAABqAAAABwAAAAcavCZq0dERUYAAAHEAAAAHQAAAB4AJwD1R1BPUwAAAeQAAASjAAAJni1yF0JHU1VCAAAGiAAAAIEAAACooGKInk9TLzIAAAcMAAAAYAAAAGCh3ZrDY21hcAAAB2wAAAGcAAACAv1rbL5jdnQgAAAJCAAAADIAAAA8K3MG4GZwZ20AAAk8AAAE+gAACZGLC3pBZ2FzcAAADjgAAAAIAAAACAAAABBnbHlmAAAOQAAAQG0AAHBIDuDVH2hlYWQAAE6wAAAANAAAADYHgk2EaGhlYQAATuQAAAAgAAAAJA37BfVobXR4AABPBAAAAjgAAAO8MaBM1GxvY2EAAFE8AAAB1QAAAeB9N5qybWF4cAAAUxQAAAAgAAAAIAMhAjxuYW1lAABTNAAAAeQAAARWRvKTBXBvc3QAAFUYAAAB+AAAAvgEbWOAcHJlcAAAVxAAAACQAAAAkPNEIux3ZWJmAABXoAAAAAYAAAAGxDVUvgAAAAEAAAAA0MoNVwAAAADJQhegAAAAANDkdLN42mNgZGBg4AFiMSBmYmAEwndAzALmMQAADdgBHQAAAHjarZZNTJRHGMf/uyzuFm2RtmnTj2hjKKE0tikxAb...")
    user_workload_monitoring_test.go:65: scenario "assert user workload metrics" failed

seems to be unrelated, as it is not using grpc key material, but let's see if this failure is consistent.

}

crt, crtPresent := s.Data["ca.crt"]
key, keyPresent := s.Data["ca.key"]
if _, ok := s.Annotations["monitoring.openshift.io/grpc-forced-rotate"]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if _, ok := s.Annotations["monitoring.openshift.io/grpc-forced-rotate"]; ok {
if _, ok := s.Annotations["monitoring.openshift.io/grpc-cert-forced-rotate"]; ok {

Let's be a little bit more explicit what it is about

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not just the cert, it is the complete key material, including server/client certs and keys, hence i suggest grpc-tls-forced-rotate to gather the whole context.

klog.Warningf("creating a new CA due to error reading CA: %v", err)
} else if needsNewCert(curCA.Config.Certs[0].NotBefore, curCA.Config.Certs[0].NotAfter, time.Now) {
rotate = true
fmt.Println("##### certs expired")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoopsie

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happens to me as well often, its WIP. And no curse words included so all good ;)

@s-urbaniak s-urbaniak force-pushed the tls-rotation branch 2 times, most recently from 4b5aa4e to 797d921 Compare June 22, 2020 10:58
@s-urbaniak s-urbaniak changed the title WIP pkg/manifests: rotate grpc key material gracefully pkg/manifests: rotate grpc key material gracefully Jun 22, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 22, 2020
@@ -59,6 +59,9 @@ func TestUserWorkloadMonitoring(t *testing.T) {
{"assert user workload rules", assertUserWorkloadRules},
{"assert tenancy model is enforced", assertTenancyForMetrics},
{"assert prometheus and alertmanager is not deployed in user namespace", assertPrometheusAlertmanagerInUserNamespace},
{"assert grpc tls rotation", assertGRPCTLSRotation},
{"assert user workload metrics", assertUserWorkloadMetrics},
{"assert user workload rules", assertUserWorkloadRules},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @openshift/openshift-team-monitoring i am lacking a bit of creativity here tbh 🤔 do you have other ideas how we could test this, other than a) enforcing rotation and b) asserting if everything "still" works?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be adding this to this existing test? I would prefer a separate test case here. TestGRPCTLSUserWorkloadMonitoring maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, that makes things a bit more complex though, as tests run in parallel and I'd like to execute rotation in the context of user workload monitoring too.

I'd need to refactor then the e2e test suite here so we test two (potentially concurrently running) user workload monitoring tests one dedicated for rotation, the other for asserting metrics/alerts. Do you think the added complexity is worth it? If yes, I'd go ahead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just feel like this test is growing, that is my main concern and we are testing too much in one test which would make it not clear what is failing. Its not a blocker for me, happy to have a follow up PR in 4.7 or after feature freeze to refactor this as well, wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree, I feel on the edge here, the list is still well readable but I agree we should start thinking about a potential refactor. I submitted https://issues.redhat.com/browse/MON-1170 to track this and would suggest to tackle it as a follow-up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That sounds good!

@s-urbaniak
Copy link
Contributor Author

/test e2e-aws-operator

1 similar comment
@s-urbaniak
Copy link
Contributor Author

/test e2e-aws-operator

@s-urbaniak
Copy link
Contributor Author

Another failure in TestMultinamespacePrometheusRule, investigating.

@s-urbaniak
Copy link
Contributor Author

/test e2e-aws-operator

@s-urbaniak
Copy link
Contributor Author

build is green, but I would like to execute a couple more runs.

@s-urbaniak
Copy link
Contributor Author

/retest

@s-urbaniak
Copy link
Contributor Author

/test e2e-aws-operator

Copy link
Contributor

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! 🎉
One question -> do we trigger the controller resync on this file?

Couple of edge cases questions, otherwise lgtm

if err != nil {
return nil, err
klog.Warningf("creating a new CA due to error reading CA: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning msg is a bit hard to understand what actually we are warning the user? That we failed to get CA due to an error, there was no creation yet, correct?

We just continue here not return? Its because we create a new one later on, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a quirky edge case. It is hard as we are indeed flying blind what went wrong here. For some reason reading the existing CA failed (maybe someone modified it manually, or any other corruption happened)?

We have two options:
a) bail out here and crash-loop and wait for the user to resolve the corrupted CA (not good)
b) what I implemented here: log out the issue and simply re-issue a new CA.

wdyt?

if crtPresent && keyPresent {
ca, err := crypto.GetCAFromBytes(crt, key)
if !rotate {
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there be a case where rotate is false and we could not get current CA or its nil? In that case we do nothing, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rotate is initialized already with rotate = !crtPresent || !keyPresent, but you're right, I missed setting it to true if reading failed 👍

Let me add a unit test for this case too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, i do have a unit test for this edge case, let me recheck 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha, great catch, the unit test is wrong and hence also the logic, fixing 🎉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}

// CreateCertificate creates a new certificate and returns it in x509.Certificate form.
func CreateCertificate(template, parent *x509.Certificate, pub, priv interface{}) (*x509.Certificate, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this function should be exposed, seems like we only use it in this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh no reason, let me set it to private.

@s-urbaniak
Copy link
Contributor Author

@openshift/openshift-team-monitoring ptal

@s-urbaniak
Copy link
Contributor Author

One question -> do we trigger the controller resync on this file?
yes as per

Copy link
Contributor

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly nitpicks really apart from one question for unit tests, feel free to address the nits if you want, nothing too urgent at all!

/lgtm

/hold
feel free to remove hold if you want to skip addressing the nits! 🎉 🚢

}

crt, crtPresent := s.Data["ca.crt"]
key, keyPresent := s.Data["ca.key"]
if _, ok := s.Annotations["monitoring.openshift.io/grpc-tls-forced-rotate"]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can we define these Annotations as part of the constant, to avoid typos. Happy to have it on function level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally not a fan of this, as those literals are intrinsic constants already (we don't follow that idiom elsewhere here and in many golang projects too), especially given the fact that this is not a library used externally.

sets.NewString("prometheus-grpc"),
crypto.DefaultCertificateLifetimeInDays,
)
if err != nil {
return nil, errors.Wrap(err, "error making prometheus-k8s server certificate")
return errors.Wrap(err, "error making prometheus-k8s server certificate")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: What does prometheus-k8s mean to users? 🤔 e.g. the one we deploy in openshift-cluster-monitoring, correct? To users might be helpful to include that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a good point, we actually have only two certificate, one server cert (used in thanos ruler and prometheus) and one client cert. I will use these terms instead.

{
name: "force rotation",
setup: func(s *v1.Secret) {
s.Annotations["monitoring.openshift.io/grpc-tls-forced-rotate"] = "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add an Annotation withblah/foo-bar? Would it make sense to have that testcase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected behavior would be a no-op, not sure about the benefit, but I can add a test case

loaded, err := t.client.GetSecret(s.Namespace, s.Name)
switch {
case apierrors.IsNotFound(err):
// No secret was found, proceed with the default empty secret from manifests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add debug logs here besides comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a great idea 👍

case apierrors.IsNotFound(err):
// No secret was found, proceed with the default empty secret from manifests.
case err == nil:
// Secret was found, use that.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above, up to you if you think its needed as you were developing this.

// This method creates a central secret containing GRPC TLS key material
// if the given secret is nil
// or updates it if the CA present in the given secret is about to expire.
func (f *Factory) GRPCSecret() (*v1.Secret, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can move this to the manifests.go now, that is purely just manifests generation? Not urgent to address just as we have this pattern already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i was back and forth on this one. It could be considered the first step to untangle manifests.go which is too big already. I would leave it here and consequently cut manifests.go in separate files as a follow-up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the test for it is in pkg/manifests/manifests_test.go , I would then either move this function there or move the test to tls_test.go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move the test to tls_test.go indeed and set a first cutting precedence 👍

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 26, 2020
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 26, 2020
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jun 29, 2020
@s-urbaniak
Copy link
Contributor Author

@lilic PTAL and unhold if you're fine with the proposed changes.

Copy link
Contributor

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit about GRPCSecret being tested in manifests_test.go but not moved to manifests.go :) otherwise:

/lgtm
/hold

feel free to remove hold if you wanted it this way. 🎉

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 29, 2020
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jun 29, 2020
Copy link
Contributor

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

🎉

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 29, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lilic, s-urbaniak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@lilic
Copy link
Contributor

lilic commented Jun 29, 2020

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 29, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 3d25c0d into openshift:master Jun 29, 2020
@s-urbaniak s-urbaniak changed the title pkg/manifests: rotate grpc key material gracefully MON-977 pkg/manifests: rotate grpc key material gracefully Jun 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants