Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1860774: Allow fallback to serving cert renewal accounting for egress IPs on SDN #137

Merged
merged 1 commit into from Oct 25, 2021

Conversation

JoelSpeed
Copy link
Contributor

When using the OpenShift Egress IP feature (only available on select platforms, eg vSphere), with SDN, additional IPs are assigned to the network interfaces for the VM. These are then picked up by kubelet and added to the CSR SANs.

In IPI clusters, these IPs are also picked up by Machine API and the IP addresses are listed in the Machine status. During certificate renewals, these IPs are matched and the CSR is approved.

When using IPI clusters, we expect the SANs to match. Because Egress IPs come and go, this isn't reliably the case. To allow this use case, this PR adjusts the CMA so that if the CSRs don't match exactly, because of the IP address list being different, the check will allow the CSR to contain any previously allowed IPs plus any IPs that are listed on the Node's HostSubnet EgressIPs.

This should allow the egress IPs to be moved around by SDN as appropriate and for the CSR certificate renewals to function as expected.

This is not an ideal solution as it doesn't ensure that when Egress IPs are removed, these are then removed from the CSR, but we need to find a solution that will work until CCMs are GA, at which point we believe we can revert this and teach the CCM about egress IP ranges to allow it to exclude the IPs when the CSRs are created

@openshift-ci openshift-ci bot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Oct 18, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 18, 2021

@JoelSpeed: This pull request references Bugzilla bug 1860774, which is invalid:

  • expected the bug to target the "4.10.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1860774: Allow fallback to serving cert renewal accounting for egress IPs on SDN

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@JoelSpeed
Copy link
Contributor Author

/bugzilla refresh

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 18, 2021

@JoelSpeed: An error was encountered querying GitHub for users with public email (zhsun@redhat.com) for bug 1860774 on the Bugzilla server at https://bugzilla.redhat.com. No known errors were detected, please see the full error message for details.

Full error message. non-200 OK status code: 403 Forbidden body: "{\n \"documentation_url\": \"https://docs.github.com/en/free-pro-team@latest/rest/overview/resources-in-the-rest-api#secondary-rate-limits\",\n \"message\": \"You have exceeded a secondary rate limit. Please wait a few minutes before you try again.\"\n}\n"

Please contact an administrator to resolve this issue, then request a bug refresh with /bugzilla refresh.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@JoelSpeed
Copy link
Contributor Author

/bugzilla refresh

I've managed to manually test this on a vSphere UPI cluster and can confirm that this is working.

Notes for QE:

I1018 15:28:29.290968       1 controller.go:118] Reconciling CSR: csr-wf47q
I1018 15:28:29.338023       1 csr_check.go:153] csr-wf47q: CSR does not appear to be client csr
I1018 15:28:29.341768       1 csr_check.go:507] retrieving serving cert from compute-0 (172.31.248.90:10250)
I1018 15:28:29.353138       1 csr_check.go:184] Found existing serving cert for compute-0
I1018 15:28:29.353341       1 csr_check.go:194] Could not use current serving cert for renewal: CSR Subject Alternate Name values do not match current certificate
I1018 15:28:29.353383       1 csr_check.go:195] Current SAN Values: [compute-0 172.31.248.90], CSR SAN Values: [compute-0 172.31.248.200 172.31.248.90]
I1018 15:29:56.154987       1 csr_check.go:215] Falling back to serving cert renewal with Egress IP checks
I1018 15:29:56.266509       1 controller.go:200] CSR csr-wf47q approved
  • The falling back to with egress IP checks is the important part here, the old check still runs first and fails, then we have the new check afterwards

@openshift-ci openshift-ci bot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Oct 18, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 18, 2021

@JoelSpeed: This pull request references Bugzilla bug 1860774, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.10.0) matches configured target release for branch (4.10.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @sunzhaohua2

In response to this:

/bugzilla refresh

I've managed to manually test this on a vSphere UPI cluster and can confirm that this is working.

Notes for QE:

I1018 15:28:29.290968       1 controller.go:118] Reconciling CSR: csr-wf47q
I1018 15:28:29.338023       1 csr_check.go:153] csr-wf47q: CSR does not appear to be client csr
I1018 15:28:29.341768       1 csr_check.go:507] retrieving serving cert from compute-0 (172.31.248.90:10250)
I1018 15:28:29.353138       1 csr_check.go:184] Found existing serving cert for compute-0
I1018 15:28:29.353341       1 csr_check.go:194] Could not use current serving cert for renewal: CSR Subject Alternate Name values do not match current certificate
I1018 15:28:29.353383       1 csr_check.go:195] Current SAN Values: [compute-0 172.31.248.90], CSR SAN Values: [compute-0 172.31.248.200 172.31.248.90]
I1018 15:29:56.154987       1 csr_check.go:215] Falling back to serving cert renewal with Egress IP checks
I1018 15:29:56.266509       1 controller.go:200] CSR csr-wf47q approved
  • The falling back to with egress IP checks is the important part here, the old check still runs first and fails, then we have the new check afterwards

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -121,6 +121,7 @@ rules:
- config.openshift.io
resources:
- clusteroperators
- networks
verbs:
- get
- create
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define networks permissions separately in order to not provide create permission?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

}

allowedIPAddresses := currentCert.IPAddresses
for _, ipAddr := range hostSubnet.EgressIPs {
Copy link
Contributor

@lobziik lobziik Oct 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly makes sense to me, i agree with some of Denis' comments, also given your comment in the description about removing this when CCMs are GA i think we should probably have a TODO comment on the functions that should be reviewed after CCM GA (i think authorizeServingRenewalWithEgressIPs, not sure if more are needed)

@@ -121,6 +121,7 @@ rules:
- config.openshift.io
resources:
- clusteroperators
- networks
verbs:
- get
- create
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

}

allowedIPAddresses := currentCert.IPAddresses
for _, ipAddr := range hostSubnet.EgressIPs {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

},
wantErr: "CSR Subject Alternate Names includes unknown IP addresses",
},
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add another one test case with CIDRs, just for visibility purposes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@lobziik
Copy link
Contributor

lobziik commented Oct 18, 2021

Only one nitpick about adding testcase with CIDR, aside that - looks good to me.

@JoelSpeed
Copy link
Contributor Author

Pushed a fix for the CIDR test case

@lobziik
Copy link
Contributor

lobziik commented Oct 18, 2021

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 18, 2021
}

network := &configv1.Network{}
if err := c.Get(context.Background(), client.ObjectKey{Name: "cluster"}, network); err != nil {
Copy link
Member

@enxebre enxebre Oct 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move "cluster" and "OpenShiftSDN" into constants then move the get configv1.Network{} and network.Status.NetworkType == "OpenShiftSDN" logic into a unit tested function isEgress(node) that it also accounts for https://docs.openshift.com/container-platform/4.9/networking/ovn_kubernetes_network_provider/configuring-egress-ips-ovn.html#nw-egress-ips-node_configuring-egress-ips-ovn or even the EgressIP.Status object or HostSubnet (not sure which one of the three is the most appropriate right now) into the discrimination criteria?
So the last fallback is gated by if servingCert != nil && isEgress(node)

Can this also happen for e.g OVN?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will move it out into a function but I don't think we need to account for EgressIp objects as they are OVN specific and we spoke to the network folks who don't think this is an issue on OVN Kubernetes. OVN doesn't assign actual IPs to the interfaces but instead uses IPTables magic to implement this function.

With OVN the discrimination is Egress IP objects, with SDN it's hostsubnets

- networks
verbs:
- get
- list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need any other than get?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using controller-runtime which automatically caches resources when you fetch them by setting up an informer, for informers you need list and watch as well.

The additional permissions don't give you access to any more information in this case as the network is a singleton and the hostsubnets are all required as they map 1:1 with Nodes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be using a regular client rather than a cache for this?

Copy link
Member

@enxebre enxebre Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MachineClient: uncachedManagementClient,
MachineRestCfg: managementConfig,
MachineNamespace: machineNamespace,
NodeClient: uncachedWorkloadClient,

e.g adding the relevant resources to not be cache here https://github.com/openshift/cluster-machine-approver/blob/master/main.go#L141-L146 and renaming that client to target/guestClusterClient

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add this as uncached and then avoid the informer, but I don't know if we particularly need to.

In particular, we expect the network object to never change during the lifetime of the cluster and the hostsubnets should be pretty static as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is I just see no need for an informer here since nothing needs to be watched/reacted and a cache seems unnecessary since this is requested very occasionally so there's no impact for the api server. Also although is unlikely If it happens that the controller uses the hostsubnet IP cached values it won't approve nor retry.
Either way is fine to me anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it happens that the controller uses the hostsubnet IP cached values it won't approve nor retry.

This is a valid concern, I'll make the change. I'll have to get another UPI cluster up to test with, will report back on whether it's still working later this afternoon

resources:
- hostsubnets
verbs:
- get
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need any other than get?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment about controller runtime

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for updating Joel, i'm good with this barring Alberto's comments.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 19, 2021
@JoelSpeed
Copy link
Contributor Author

/hold

I want to squash this once all of the feedback is in

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 19, 2021
@enxebre
Copy link
Member

enxebre commented Oct 21, 2021

Once CCMs are in place, the issue should not longer exist, in which case I suggest we revert this PR for releases from that point on. I'm expecting we will only maintain this for 2/3 releases.

ok cool, let's make sure we add a TODO here and capture in Jira. Thanks, this looks good on my side.

@JoelSpeed
Copy link
Contributor Author

Already got a TODO here
https://github.com/openshift/cluster-machine-approver/pull/137/files#diff-76acaac62a240f198549c59b688a4134a72a0cb75033116849fcbba92ce9ea38R312-R313

Renamed the Egress IP function to needsEgressCheck as that's more appropriate

And have added a CIDR test as per @elmiko's comment

And a JIRA to make sure we revert this https://issues.redhat.com/browse/OCPCLOUD-1310

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 21, 2021
@lobziik
Copy link
Contributor

lobziik commented Oct 25, 2021

/lgtm

@elmiko
Copy link
Contributor

elmiko commented Oct 25, 2021

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 25, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 25, 2021
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 25, 2021

@JoelSpeed: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-disruptive 0128ba6 link false /test e2e-aws-disruptive
ci/prow/e2e-azure-operator 0128ba6 link false /test e2e-azure-operator

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

6 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 0a64d91 into openshift:master Oct 25, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 25, 2021

@JoelSpeed: All pull requests linked via external trackers have merged:

Bugzilla bug 1860774 has been moved to the MODIFIED state.

In response to this:

Bug 1860774: Allow fallback to serving cert renewal accounting for egress IPs on SDN

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@JoelSpeed
Copy link
Contributor Author

/cherry-pick release-4.9

@JoelSpeed JoelSpeed deleted the egress-ip-checks branch November 16, 2021 14:02
@openshift-cherrypick-robot

@JoelSpeed: #137 failed to apply on top of branch "release-4.9":

Applying: Allow fallback to serving cert renewal accounting for egress IPs on SDN
Using index info to reconstruct a base tree...
M	main.go
M	pkg/controller/csr_check.go
M	pkg/controller/csr_check_test.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/controller/csr_check_test.go
Auto-merging pkg/controller/csr_check.go
CONFLICT (content): Merge conflict in pkg/controller/csr_check.go
Auto-merging main.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Allow fallback to serving cert renewal accounting for egress IPs on SDN
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants