New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WINC-687: Windows instances react to kubelet CA rotation #1013
Conversation
Skipping CI for Draft Pull Request. |
/test ? |
@jrvaldes: The following commands are available to trigger required jobs:
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test azure-e2e-operator |
37759d6
to
c90e94c
Compare
controllers/controllers.go
Outdated
// the kube-apiserver client. No drain or restart required. | ||
func (r *instanceReconciler) reconcileKubeletClientCA(ctx context.Context, initialCAConfigMap, | ||
bundleCAConfigMap *core.ConfigMap) error { | ||
r.log.Info("reconciling kubelet CA client certificates with", "ConfigMap", bundleCAConfigMap.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WMCO log after triggering the kubelet CA rotation with two (2) Windows nodes.
1.648497188796151e+09 INFO controllers.configmap reconciling kubelet CA client certificates with {"ConfigMap": "kube-apiserver-to-kubelet-client-ca"}
1.6484971887963562e+09 DEBUG controllers.configmap creating auxiliary kubelet CA certificate {"file": "/tmp/kubelet-ca.crt"}
1.6484971887966876e+09 DEBUG controllers.configmap processing {"node count": 2}
1.6484971887967196e+09 DEBUG wc 172.31.251.210 initializing SSH connection
1.6484971890362916e+09 INFO controllers.configmap updating kubelet CA client certificates in {"node": "winworker-f2n8m"}
1.6484971894075413e+09 DEBUG wc 172.31.251.210 run {"cmd": "powershell.exe -NonInteractive -ExecutionPolicy Bypass \"Test-Path C:\\k\\\\kubelet-ca.crt\"", "out": "True\r\n"}
1.6484971897684174e+09 DEBUG wc 172.31.251.210 run {"cmd": "powershell.exe -NonInteractive -ExecutionPolicy Bypass \"$out = Get-FileHash C:\\k\\\\kubelet-ca.crt -Algorithm SHA256; $out.Hash\"", "out": "6C4A61CB99D4C329E67E303DE77EDFF54E9193A2B108A13AC4FAA1BE2F0199AF\r\n"}
1.648497189768449e+09 DEBUG wc 172.31.251.210 copy {"local file": "/tmp/kubelet-ca.crt", "remote dir": "C:\\k\\"}
1.6484971898165677e+09 DEBUG wc 172.31.251.88 initializing SSH connection
1.6484971900478635e+09 INFO controllers.configmap updating kubelet CA client certificates in {"node": "winworker-9j74z"}
1.6484971904268918e+09 DEBUG wc 172.31.251.88 run {"cmd": "powershell.exe -NonInteractive -ExecutionPolicy Bypass \"Test-Path C:\\k\\\\kubelet-ca.crt\"", "out": "True\r\n"}
1.6484971908091497e+09 DEBUG wc 172.31.251.88 run {"cmd": "powershell.exe -NonInteractive -ExecutionPolicy Bypass \"$out = Get-FileHash C:\\k\\\\kubelet-ca.crt -Algorithm SHA256; $out.Hash\"", "out": "6C4A61CB99D4C329E67E303DE77EDFF54E9193A2B108A13AC4FAA1BE2F0199AF\r\n"}
1.6484971908091886e+09 DEBUG wc 172.31.251.88 copy {"local file": "/tmp/kubelet-ca.crt", "remote dir": "C:\\k\\"}
1.648497190859712e+09 DEBUG controllers.configmap deleting kubelet CA certificate {"file": "/tmp/kubelet-ca.crt"}
/hold Depends on submodule update with openshift/windows-machine-config-bootstrapper#344 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with the initial CA configmap -- my main concern (might be a stupid one) can be summarized as follows: if the initial kubelet CA is only valid for one year, then do we have a code path here that breaks after a year?
controllers/configmap_controller.go
Outdated
@@ -21,6 +21,7 @@ import ( | |||
"net" | |||
|
|||
config "github.com/openshift/api/config/v1" | |||
"github.com/openshift/windows-machine-config-operator/pkg/certificates" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to next import grouping
controllers/configmap_controller.go
Outdated
@@ -21,6 +21,7 @@ import ( | |||
"net" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: IMO the commit message should be trimmed. The middle two paragraphs describing the different scenarios that could trigger CA bundle updates can be dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
pkg/windows/windows.go
Outdated
@@ -489,6 +494,25 @@ func (vm *windows) ConfigureAzureCloudNodeManager(nodeName string) error { | |||
return nil | |||
} | |||
|
|||
// UpdateKubeletClientCA updates the kubelet client CA certificate file in the Windows node. No service restart or | |||
// reboot required, kubelet detects the changes in the file system and use the new CA certificate. The file is replaced | |||
// only and only if it does not exist or there is a checksum mismatch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only and only if
--> if and only if
pkg/certificates/bundle.go
Outdated
} | ||
} | ||
|
||
func GetInitialCAConfigMap(ctx context.Context, client client.Client) (*core.ConfigMap, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc for this method
controllers/controllers.go
Outdated
return err | ||
} | ||
// extract CAs from new bundle CA ConfigMap | ||
kubeAPIServerServingCABytes, err := certificates.GetCAsFromConfigMap(bundleCAConfigMap, "ca-bundle.crt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider making "ca-bundle.crt"
a constant with a doc explaining where this value comes from
controllers/controllers.go
Outdated
// TODO: Improve windows.sshConnectivity.transfer() to avoid below IO operations. | ||
// We already have the content of the the bundle CA in memory, but we are forced to dump it into a file because | ||
// is the only mechanism accepted by windows.sshConnectivity.transfer(). The later IO operations are expensive and | ||
// could be avoided by overloading .transfer() to accept a byte array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would appreciate the creation of a tech debt story around this. And linking the story in this TODO comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to happen once a year or once in 6 months. I don't think people are going to keep rotating it. I do understand IO operation is expensive but we don't need to worry about it in this situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But interesting to know data is read from file and is the only way available now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that is my take too. I don't feel we should track it in the backlog (for now).
Correct, it is a time bomb[1] |
controllers/controllers.go
Outdated
return err | ||
} | ||
// extract CAs from new bundle CA ConfigMap | ||
kubeAPIServerServingCABytes, err := certificates.GetCAsFromConfigMap(bundleCAConfigMap, certificates.CABundleKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you name the previous one initialKubeAPI.., you can name this new or current-kubeAPIser...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No prefix means current.
controllers/controllers.go
Outdated
|
||
// TODO: Improve windows.sshConnectivity.transfer() to avoid below IO operations. | ||
// We already have the content of the the bundle CA in memory, but we are forced to dump it into a file because | ||
// is the only mechanism accepted by windows.sshConnectivity.transfer(). The later IO operations are expensive and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because it is
controllers/controllers.go
Outdated
// TODO: Improve windows.sshConnectivity.transfer() to avoid below IO operations. | ||
// We already have the content of the the bundle CA in memory, but we are forced to dump it into a file because | ||
// is the only mechanism accepted by windows.sshConnectivity.transfer(). The later IO operations are expensive and | ||
// could be avoided by overloading .transfer() to accept a byte array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to happen once a year or once in 6 months. I don't think people are going to keep rotating it. I do understand IO operation is expensive but we don't need to worry about it in this situation.
controllers/controllers.go
Outdated
// TODO: Improve windows.sshConnectivity.transfer() to avoid below IO operations. | ||
// We already have the content of the the bundle CA in memory, but we are forced to dump it into a file because | ||
// is the only mechanism accepted by windows.sshConnectivity.transfer(). The later IO operations are expensive and | ||
// could be avoided by overloading .transfer() to accept a byte array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But interesting to know data is read from file and is the only way available now
} | ||
r.log.V(1).Info("processing", "node count", len(winNodes.Items)) | ||
// loop Windows nodes and trigger kubelet CA update | ||
for _, winNode := range winNodes.Items { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assume if we have 3 Windows nodes and the second one fails due to some error condition, aren't we going to try the 3rd node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. In case the 3rd node fails, the reconciler will reconcile the failed event where nodes 1 and 2 are up to date, and only the 3rd node will get the file update.
pkg/certificates/bundle.go
Outdated
const ( | ||
// CABundleKey is the key in the Kube API Server CA ConfigMap where the CA bundle is located | ||
CABundleKey = "ca-bundle.crt" | ||
// KubeApiServerOperatorNamespace is the name of the namespace of the ConfigMap that contains the CA for the kubelet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
namespace of the ConfigMap - ConfigMap's namespace
pkg/certificates/bundle.go
Outdated
// recognize the kube-apiserver client certificate. | ||
KubeAPIServerServingCAConfigMapName = "kube-apiserver-to-kubelet-client-ca" | ||
|
||
// configNamespace is the name of the namespace of the ConfigMap that contains the initial kubelet client CA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
namespace of the ConfigMap - ConfigMap's namespace
pkg/certificates/bundle.go
Outdated
c, err := x509.ParseCertificate(b.Bytes) | ||
if err != nil { | ||
// glog.Warningf("Could not parse initial bundle certificate: %v", err) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we keep hitting in this part, can this become an infinite loop?. The for loop is running as long as len(initialBundle) > 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, sounds like a good case for unit testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see unit tests around this functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 6498dc0
test/e2e/validation_test.go
Outdated
@@ -464,6 +464,11 @@ func testExpectedServicesRunning(t *testing.T) { | |||
} | |||
} | |||
|
|||
// testKubeletCARotation test the kubelet CA rotation | |||
func testKubeletCARotation(t *testing.T) { | |||
t.Skip("Not implemented yet") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there going to be a different ticket to track this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Included in e163351
89b06cb
to
e163351
Compare
controllers/controllers.go
Outdated
// get default username | ||
username := r.getDefaultUsername() | ||
// check node for username annotation (BYOH case) | ||
usernameAnnotation := node.Annotations[UsernameAnnotation] | ||
if usernameAnnotation == "" { | ||
return nil, errors.New("node is missing valid username annotation") | ||
if usernameAnnotation != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The instanceReconciler struct is a way for us to handle Nodes and Windows Instances in a generic way. By introducing more specific casing like this, we are moving away from that design. I'd prefer that the username annotation is applied to all Windows nodes, including Machine based Nodes.
That would involve passing in the annotation in this call, like is done with the ConfigMap reconciler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I thought about that too. It's more nosy (and complex) so I went this route instead.
Happy to chat more about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am leaning towards including the Windows annotation on all nodes too, but I understand that may be out of scope of this story. Another alternative seems to be splitting this into 2 functions upon the respective Reconciler structs, one for machine nodes and one for BYOH?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how complex is adding Windows annotation on all nodes? I understand it is out of the scope of this PR. But the change that we are making has a direct impact on the current design. So curious to know the complexity level for this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that involved from what I can tell:
- call
crypto.EncryptToJSONString
to get encrypted value of Machine username - pass in encrypted username annotation to
r.ensureInstanceIsUpToDate
call in wm_controller - remove this if/else block in secret_controller, leaving only the code from within the
if
- for e2e tests, add UsernameAnnotation to
annotations
and remove theif BYOH
check here - remove the
isBYOH
condition from this if statement in create_test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saifshaikh48 thanks for highlighting the implementation details.
Dropped 804520f in favor of
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this if/else block in secret_controller, leaving only the code from within the if
It makes sense, but that is a change of behavior. I don't think we should introduce it as part of this effort. So, far Windows instances configured by MachineController reconcile the annotation by "deletion".
85f4ea9
to
9818df0
Compare
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jrvaldes, sebsoto The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/retest-required Please review the full test history for this PR and help us cut down flakes. |
11 similar comments
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
@jrvaldes: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/cherry-pick release-4.10 |
/cherry-pick community-4.10 |
@jrvaldes: #1013 failed to apply on top of branch "release-4.10":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@jrvaldes: #1013 failed to apply on top of branch "community-4.10":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
serving CA ConfigMap from the Kube API Server
kubelet CA rotation
location and name of the file is fixed to
C:\k\kubelet-ca.crt
by WMCB whilesetting the kubelet configuration.
The initial Kubelet CA certificate is valid for 1 year, from the date of the
cluster installation. Usually, the first rotation of the kubelet CA certificate
is generated by the API Server Operator
at 80% for the CA certificate lifecycle (1 year), which is 292 days, and removes
the previous CA certificate after 365 days.
Alternatively, a cluster administrator can trigger the kubelet CA rotation
on-demand at any time with the following command:
The proposed implementation covers both of the above scenarios and does not require
node drain or node reboot. When the new CA bundle (kubelet-ca.crt) is updated
in the worker node, kubelet detects the change in the file system, reads the
certificates from the file, and updates the client configuration.
Related MCO PRs: