New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1860774: Allow fallback to serving cert renewal accounting for egress IPs on SDN #137
Bug 1860774: Allow fallback to serving cert renewal accounting for egress IPs on SDN #137
Conversation
|
@JoelSpeed: This pull request references Bugzilla bug 1860774, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/bugzilla refresh |
|
@JoelSpeed: An error was encountered querying GitHub for users with public email (zhsun@redhat.com) for bug 1860774 on the Bugzilla server at https://bugzilla.redhat.com. No known errors were detected, please see the full error message for details. Full error message.
non-200 OK status code: 403 Forbidden body: "{\n \"documentation_url\": \"https://docs.github.com/en/free-pro-team@latest/rest/overview/resources-in-the-rest-api#secondary-rate-limits\",\n \"message\": \"You have exceeded a secondary rate limit. Please wait a few minutes before you try again.\"\n}\n"
Please contact an administrator to resolve this issue, then request a bug refresh with In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/bugzilla refresh I've managed to manually test this on a vSphere UPI cluster and can confirm that this is working. Notes for QE:
|
|
@JoelSpeed: This pull request references Bugzilla bug 1860774, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
manifests/01-rbac.yaml
Outdated
| @@ -121,6 +121,7 @@ rules: | |||
| - config.openshift.io | |||
| resources: | |||
| - clusteroperators | |||
| - networks | |||
| verbs: | |||
| - get | |||
| - create | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we define networks permissions separately in order to not provide create permission?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| } | ||
|
|
||
| allowedIPAddresses := currentCert.IPAddresses | ||
| for _, ipAddr := range hostSubnet.EgressIPs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that egressCIDRs should also be counted and checked there.
https://docs.openshift.com/container-platform/4.6/networking/openshift_sdn/assigning-egress-ips.html#nw-egress-ips-automatic_egress-ips
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly makes sense to me, i agree with some of Denis' comments, also given your comment in the description about removing this when CCMs are GA i think we should probably have a TODO comment on the functions that should be reviewed after CCM GA (i think authorizeServingRenewalWithEgressIPs, not sure if more are needed)
manifests/01-rbac.yaml
Outdated
| @@ -121,6 +121,7 @@ rules: | |||
| - config.openshift.io | |||
| resources: | |||
| - clusteroperators | |||
| - networks | |||
| verbs: | |||
| - get | |||
| - create | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| } | ||
|
|
||
| allowedIPAddresses := currentCert.IPAddresses | ||
| for _, ipAddr := range hostSubnet.EgressIPs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| }, | ||
| wantErr: "CSR Subject Alternate Names includes unknown IP addresses", | ||
| }, | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add another one test case with CIDRs, just for visibility purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
Only one nitpick about adding testcase with CIDR, aside that - looks good to me. |
|
Pushed a fix for the CIDR test case |
|
/lgtm |
pkg/controller/csr_check.go
Outdated
| } | ||
|
|
||
| network := &configv1.Network{} | ||
| if err := c.Get(context.Background(), client.ObjectKey{Name: "cluster"}, network); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move "cluster" and "OpenShiftSDN" into constants then move the get configv1.Network{} and network.Status.NetworkType == "OpenShiftSDN" logic into a unit tested function isEgress(node) that it also accounts for https://docs.openshift.com/container-platform/4.9/networking/ovn_kubernetes_network_provider/configuring-egress-ips-ovn.html#nw-egress-ips-node_configuring-egress-ips-ovn or even the EgressIP.Status object or HostSubnet (not sure which one of the three is the most appropriate right now) into the discrimination criteria?
So the last fallback is gated by if servingCert != nil && isEgress(node)
Can this also happen for e.g OVN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will move it out into a function but I don't think we need to account for EgressIp objects as they are OVN specific and we spoke to the network folks who don't think this is an issue on OVN Kubernetes. OVN doesn't assign actual IPs to the interfaces but instead uses IPTables magic to implement this function.
With OVN the discrimination is Egress IP objects, with SDN it's hostsubnets
manifests/01-rbac.yaml
Outdated
| - networks | ||
| verbs: | ||
| - get | ||
| - list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need any other than get?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using controller-runtime which automatically caches resources when you fetch them by setting up an informer, for informers you need list and watch as well.
The additional permissions don't give you access to any more information in this case as the network is a singleton and the hostsubnets are all required as they map 1:1 with Nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be using a regular client rather than a cache for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cluster-machine-approver/main.go
Lines 155 to 158 in 7b7289a
| MachineClient: uncachedManagementClient, | |
| MachineRestCfg: managementConfig, | |
| MachineNamespace: machineNamespace, | |
| NodeClient: uncachedWorkloadClient, |
e.g adding the relevant resources to not be cache here https://github.com/openshift/cluster-machine-approver/blob/master/main.go#L141-L146 and renaming that client to target/guestClusterClient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add this as uncached and then avoid the informer, but I don't know if we particularly need to.
In particular, we expect the network object to never change during the lifetime of the cluster and the hostsubnets should be pretty static as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking is I just see no need for an informer here since nothing needs to be watched/reacted and a cache seems unnecessary since this is requested very occasionally so there's no impact for the api server. Also although is unlikely If it happens that the controller uses the hostsubnet IP cached values it won't approve nor retry.
Either way is fine to me anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it happens that the controller uses the hostsubnet IP cached values it won't approve nor retry.
This is a valid concern, I'll make the change. I'll have to get another UPI cluster up to test with, will report back on whether it's still working later this afternoon
| resources: | ||
| - hostsubnets | ||
| verbs: | ||
| - get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need any other than get?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above comment about controller runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for updating Joel, i'm good with this barring Alberto's comments.
|
/hold I want to squash this once all of the feedback is in |
ok cool, let's make sure we add a TODO here and capture in Jira. Thanks, this looks good on my side. |
655a406
to
0128ba6
Compare
|
Already got a TODO here Renamed the Egress IP function to And have added a CIDR test as per @elmiko's comment And a JIRA to make sure we revert this https://issues.redhat.com/browse/OCPCLOUD-1310 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
/lgtm |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: elmiko The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
3 similar comments
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
@JoelSpeed: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
6 similar comments
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
@JoelSpeed: All pull requests linked via external trackers have merged: Bugzilla bug 1860774 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherry-pick release-4.9 |
|
@JoelSpeed: #137 failed to apply on top of branch "release-4.9": In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
When using the OpenShift Egress IP feature (only available on select platforms, eg vSphere), with SDN, additional IPs are assigned to the network interfaces for the VM. These are then picked up by kubelet and added to the CSR SANs.
In IPI clusters, these IPs are also picked up by Machine API and the IP addresses are listed in the Machine status. During certificate renewals, these IPs are matched and the CSR is approved.
When using IPI clusters, we expect the SANs to match. Because Egress IPs come and go, this isn't reliably the case. To allow this use case, this PR adjusts the CMA so that if the CSRs don't match exactly, because of the IP address list being different, the check will allow the CSR to contain any previously allowed IPs plus any IPs that are listed on the Node's HostSubnet EgressIPs.
This should allow the egress IPs to be moved around by SDN as appropriate and for the CSR certificate renewals to function as expected.
This is not an ideal solution as it doesn't ensure that when Egress IPs are removed, these are then removed from the CSR, but we need to find a solution that will work until CCMs are GA, at which point we believe we can revert this and teach the CCM about egress IP ranges to allow it to exclude the IPs when the CSRs are created