Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETCD-535: Manual CA rotation should rotate all leaf certs #1200

Merged
merged 1 commit into from Mar 14, 2024

Conversation

tjungblu
Copy link
Contributor

@tjungblu tjungblu commented Feb 8, 2024

/hold

The cool thing is that we're now able to "swap" signers with the existing logic with:

$ oc get secret etcd-signer -n openshift-etcd -ojson | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' | oc apply -n openshift-config -f -

which is effectively overwriting the new signer from openshift-etcd into the old signer in openshift-config. That works, because the bundle with the new signer is already distributed to all CP nodes. CEO will then proceed to rewrite all leaf certs, which are then rolled out together via etcd-all-certs.

Manual rotation is then just two step manual process:

Generate new signer:

$ oc delete secret etcd-signer -n openshift-etcd

... wait for the rollout ...

Replace the old signer with the new signer:

$ oc get secret etcd-signer -n openshift-etcd -ojson | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' | oc apply -n openshift-config -f -

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 8, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 8, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 8, 2024

@tjungblu: This pull request references ETCD-535 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

/hold

This is expected to increase the CPU usage, given the node update frequency and the amount of cert parsing/validation introduced.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 8, 2024
@tjungblu
Copy link
Contributor Author

tjungblu commented Feb 9, 2024

/test e2e-operator

@tjungblu tjungblu marked this pull request as draft February 9, 2024 09:49
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 9, 2024
@tjungblu
Copy link
Contributor Author

tjungblu commented Feb 9, 2024

/test e2e-operator

1 similar comment
@tjungblu
Copy link
Contributor Author

tjungblu commented Feb 9, 2024

/test e2e-operator

@tjungblu
Copy link
Contributor Author

tjungblu commented Feb 9, 2024

/test e2e-operator

@tjungblu
Copy link
Contributor Author

tjungblu commented Feb 9, 2024

/test unit

@tjungblu
Copy link
Contributor Author

tjungblu commented Feb 9, 2024

/test e2e-operator
/test unit

@tjungblu
Copy link
Contributor Author

/test e2e-operator

@tjungblu
Copy link
Contributor Author

/test e2e-operator
/test unit

1 similar comment
@tjungblu
Copy link
Contributor Author

/test e2e-operator
/test unit

@tjungblu
Copy link
Contributor Author

/test e2e-operator
/test unit

@tjungblu
Copy link
Contributor Author

/test e2e-operator

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 13, 2024

@tjungblu: This pull request references ETCD-535 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

/hold

This is expected to increase the CPU usage, given the node update frequency and the amount of cert parsing/validation introduced.


Some results so far:

  • the node update along their heartbeat interval, so the controller triggers every couple of seconds -> reducing the informer to only the CP nodes
  • with the CP nodes only, we add about 5% CPU usage to the operator - biggest chunk is still TLS handshakes with etcd (about 30-40% - which is still too high for my taste, given that we cache the clients)

The cool thing is that we're now able to "swap" signers with the existing logic with:

$ oc get secret etcd-signer -n openshift-etcd -ojson | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' | oc apply -n openshift-config -f -

which is effectively overwriting the new signer from openshift-etcd into the old signer in openshift-config. That works, because the bundle with the new signer is already distributed to all CP nodes. CEO will then proceed to rewrite all leaf certs, which are then rolled out together via etcd-all-certs.

Manual rotation is then just two step manual process:

Generate new signer:

$ oc delete secret etcd-signer -n openshift-etcd

... wait for the rollout ...

Replace the old signer with the new signer:

$ oc get secret etcd-signer -n openshift-etcd -ojson | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' | oc apply -n openshift-config -f -

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 17, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 27, 2024
@tjungblu tjungblu force-pushed the manual_rota branch 2 times, most recently from db8112d to ba6481c Compare February 27, 2024 09:43
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 27, 2024

@tjungblu: This pull request references ETCD-535 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

/hold

This is expected to increase the CPU usage, given the node update frequency and the amount of cert parsing/validation introduced.


Some results so far:

  • measure....

The cool thing is that we're now able to "swap" signers with the existing logic with:

$ oc get secret etcd-signer -n openshift-etcd -ojson | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' | oc apply -n openshift-config -f -

which is effectively overwriting the new signer from openshift-etcd into the old signer in openshift-config. That works, because the bundle with the new signer is already distributed to all CP nodes. CEO will then proceed to rewrite all leaf certs, which are then rolled out together via etcd-all-certs.

Manual rotation is then just two step manual process:

Generate new signer:

$ oc delete secret etcd-signer -n openshift-etcd

... wait for the rollout ...

Replace the old signer with the new signer:

$ oc get secret etcd-signer -n openshift-etcd -ojson | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' | oc apply -n openshift-config -f -

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 27, 2024

@tjungblu: This pull request references ETCD-535 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

/hold

The cool thing is that we're now able to "swap" signers with the existing logic with:

$ oc get secret etcd-signer -n openshift-etcd -ojson | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' | oc apply -n openshift-config -f -

which is effectively overwriting the new signer from openshift-etcd into the old signer in openshift-config. That works, because the bundle with the new signer is already distributed to all CP nodes. CEO will then proceed to rewrite all leaf certs, which are then rolled out together via etcd-all-certs.

Manual rotation is then just two step manual process:

Generate new signer:

$ oc delete secret etcd-signer -n openshift-etcd

... wait for the rollout ...

Replace the old signer with the new signer:

$ oc get secret etcd-signer -n openshift-etcd -ojson | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' | oc apply -n openshift-config -f -

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@tjungblu
Copy link
Contributor Author

/test e2e-operator
/test unit

@tjungblu tjungblu marked this pull request as ready for review March 12, 2024 15:27
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 12, 2024
@openshift-ci openshift-ci bot requested a review from Elbehery March 12, 2024 15:29
Comment on lines 16 to 20
type CARotatingTargetCertCreator struct {
certrotation.TargetCertCreator
}

func (c *CARotatingTargetCertCreator) NeedNewTargetCertKeyPair(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I get that we're overriding the NeedNewTargetCertKeyPair() method here to check the extra constraints on the signer not matching to trigger rotation.

Shouldn't we be using CARotatingTargetCertCreator somewhere then?
e.g:
https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/tlshelpers/tlshelpers.go#L216
or
https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/tlshelpers/tlshelpers.go#L109

(Unless I'm being dense and missing where we have that already).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL you're right. Sorry, I'll get that done throughout the day.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
@tjungblu
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 13, 2024
@hasbro17
Copy link
Contributor

/lgtm
/retest-required

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 14, 2024
Copy link
Contributor

openshift-ci bot commented Mar 14, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hasbro17, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 479c2c7 and 2 for PR HEAD f7ab538 in total

Copy link
Contributor

openshift-ci bot commented Mar 14, 2024

@tjungblu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-qe-no-capabilities f7ab538 link false /test e2e-gcp-qe-no-capabilities
ci/prow/e2e-aws-etcd-recovery f7ab538 link false /test e2e-aws-etcd-recovery

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD a2747fc and 1 for PR HEAD f7ab538 in total

@tjungblu
Copy link
Contributor Author

/override ci/prow/e2e-operator-fips

unrelated OLM failure

Copy link
Contributor

openshift-ci bot commented Mar 14, 2024

@tjungblu: Overrode contexts on behalf of tjungblu: ci/prow/e2e-operator-fips

In response to this:

/override ci/prow/e2e-operator-fips

unrelated OLM failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tjungblu
Copy link
Contributor Author

I'm going to set this label to not let a full week of CI run data go to waste - sorry it took so long to retest and eventually get here

/label acknowledge-critical-fixes-only

@openshift-ci openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Mar 14, 2024
@tjungblu
Copy link
Contributor Author

sigh, another retest for today 🥱

@openshift-merge-bot openshift-merge-bot bot merged commit eeef803 into openshift:master Mar 14, 2024
12 of 14 checks passed
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-etcd-operator-container-v4.16.0-202403180813.p0.geeef803.assembly.stream.el9 for distgit cluster-etcd-operator.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants