Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STOR-1441: Restart node Pods if metrics-serving-cert changed #82

Conversation

mpatlasov
Copy link
Contributor

Adding WithSecretHashAnnotationHook() for shared-resource-csi-driver-node-metrics-serving-cert ensures that new annotation is published in shared-resource-csi-driver-node DaemonSet. This, in turn, leads to node pods restart.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 9, 2023
@openshift-ci-robot
Copy link

openshift-ci-robot commented Aug 9, 2023

@mpatlasov: This pull request references STOR-1441 which is a valid jira issue.

In response to this:

Adding WithSecretHashAnnotationHook() for shared-resource-csi-driver-node-metrics-serving-cert ensures that new annotation is published in shared-resource-csi-driver-node DaemonSet. This, in turn, leads to node pods restart.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mpatlasov
Copy link
Contributor Author

/label docs-approved
/label px-approved
/approve

@openshift-ci openshift-ci bot added docs-approved Signifies that Docs has signed off on this PR px-approved Signifies that Product Support has signed off on this PR labels Aug 9, 2023
@adambkaplan
Copy link
Contributor

@mpatlasov out of curiosity, how often is this cert rotated in a typical OpenShift cluster? What is the impact in a large cluster (say 1,000 worker nodes)?

Copy link
Contributor

@adambkaplan adambkaplan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Marking as approved to unblock lgtm.

@mpatlasov can you please amend your commit with a description justifying the change? This will help future maintainers understand why we need to restart the pods if the metrics-serving-cert is changed/rotated.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 10, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adambkaplan, mpatlasov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 10, 2023
@Phaow
Copy link
Contributor

Phaow commented Aug 10, 2023

Pre verifiy passed with the pre merged build 4.14.0-0.ci.test-2023-08-10-084739-ci-ln-t81vb0b-latest

# After the metrics-serving-cert secret changed, driver controller restarted
$ oc delete secret shared-resource-csi-driver-node-metrics-serving-cert
secret "shared-resource-csi-driver-node-metrics-serving-cert" deleted

$ oc get po -l app=shared-resource-csi-driver-node -w
NAME                                    READY   STATUS        RESTARTS   AGE
shared-resource-csi-driver-node-4r2lb   2/2     Terminating   2          40m
shared-resource-csi-driver-node-76s97   2/2     Running       0          11s
shared-resource-csi-driver-node-v8xqq   2/2     Running       0          40m
shared-resource-csi-driver-node-4r2lb   0/2     Terminating   2          40m
shared-resource-csi-driver-node-fchkb   0/2     Pending       0          0s
shared-resource-csi-driver-node-fchkb   0/2     Pending       0          0s
shared-resource-csi-driver-node-fchkb   0/2     Pending       0          0s
shared-resource-csi-driver-node-fchkb   0/2     ContainerCreating   0          0s
shared-resource-csi-driver-node-fchkb   0/2     ContainerCreating   0          0s
shared-resource-csi-driver-node-4r2lb   0/2     Terminating         2          40m
shared-resource-csi-driver-node-4r2lb   0/2     Terminating         2          40m
shared-resource-csi-driver-node-4r2lb   0/2     Terminating         2          40m
shared-resource-csi-driver-node-fchkb   2/2     Running             0          1s
shared-resource-csi-driver-node-v8xqq   2/2     Terminating         0          40m
shared-resource-csi-driver-node-v8xqq   0/2     Terminating         0          41m
shared-resource-csi-driver-node-j4mlm   0/2     Pending             0          0s
shared-resource-csi-driver-node-v8xqq   0/2     Terminating         0          41m
shared-resource-csi-driver-node-j4mlm   0/2     Pending             0          0s
shared-resource-csi-driver-node-j4mlm   0/2     Pending             0          0s
shared-resource-csi-driver-node-j4mlm   0/2     ContainerCreating   0          0s
shared-resource-csi-driver-node-v8xqq   0/2     Terminating         0          41m
shared-resource-csi-driver-node-v8xqq   0/2     Terminating         0          41m
shared-resource-csi-driver-node-j4mlm   0/2     ContainerCreating   0          0s
shared-resource-csi-driver-node-j4mlm   2/2     Running             0          1s

$ oc get po -l app=shared-resource-csi-driver-node
NAME                                    READY   STATUS    RESTARTS   AGE
shared-resource-csi-driver-node-76s97   2/2     Running   0          8m4s
shared-resource-csi-driver-node-fchkb   2/2     Running   0          7m31s
shared-resource-csi-driver-node-j4mlm   2/2     Running   0          6m59s

@Phaow
Copy link
Contributor

Phaow commented Aug 10, 2023

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Aug 10, 2023
@jsafrane
Copy link
Contributor

Our doc says OCP rotates certificates every 13 months, unless admins asks earlier

The secret `shared-resource-csi-driver-node-metrics-serving-cert` is bound to the CA cert by annotation `service.beta.openshift.io/serving-cert-secret-name`. This means that if CA cert is rotated, the secret `shared-resource-csi-driver-node-metrics-serving-cert` will be automatically updated too.

This secret keeps TLS cert and key which are used to secure HTTP connection to Prometheus server which is started by OpenShift Shared Resource CSI Driver. If cert and key are updated, we need to restart CSI driver Pod to re-read new keys. Otherwise, clients coming with new cert won't be able to communicate with the server running with older key/cert.
@mpatlasov mpatlasov force-pushed the restart-controller-pods-if-metrics-serving-cert-changed branch from ddf4d3d to f6978ec Compare August 10, 2023 19:29
@mpatlasov
Copy link
Contributor Author

Hi @adambkaplan , I updated the commit description with some words explaining why we need to restart pods. As for your previous questions, see please Jan's comment above: CA cert is rotated very rarely, once in 13 months. As for impact in a large cluster, TBH I can't see why we should expect severe impact: end-users will keep using their existing PVs while CSI driver pods restarting, and new CA cert is a big deal anyway.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 10, 2023

@mpatlasov: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@jsafrane
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 11, 2023
@openshift-merge-robot openshift-merge-robot merged commit 5d48701 into openshift:master Aug 11, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. docs-approved Signifies that Docs has signed off on this PR jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. px-approved Signifies that Product Support has signed off on this PR qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants