Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dual stack vips #3269

Merged
merged 8 commits into from
Sep 16, 2022
Merged

Conversation

cybertron
Copy link
Member

Previously in dual stack clusters we only provided an IPv4 VIP because it was assumed that everything in the cluster would have access to either v4 or v6. This turns out to have been a bad assumption because some users are deploying single-stack IPv6 applications in dual stack clusters and these applications then have no access to API or Ingress services.

This PR is the MCO part of the implementation of dual stack VIPs. It switches the keepalived config to consume a list of VIPs from runtimecfg. It also implements sync logic for the PlatformStatus field of the Infrastructure object to migrate values from the deprecated VIP fields to the new plural VIP fields.

Depends on openshift/baremetal-runtimecfg#176

- What I did

- How to verify it

- Description for the changelog
Add support for dual stack VIPs in on-prem platforms

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 26, 2022
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 26, 2022
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 31, 2022
@cybertron
Copy link
Member Author

The dependency in openshift/baremetal-runtimecfg#176 has merged. I believe this should be ready to go.

/test e2e-metal-ipi
/test e2e-vsphere-upgrade

@cybertron cybertron changed the title WIP: Dual stack vips Dual stack vips Aug 31, 2022
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 31, 2022
@dobsonj
Copy link
Member

dobsonj commented Aug 31, 2022

To resolve the unit test failures, you just need to remove the CSIMigrationAWS: false, CSIMigrationGCE: false, and PodSecurity: true lines from both of these template files:
https://github.com/openshift/machine-config-operator/blob/master/templates/master/01-master-kubelet/_base/files/kubelet.yaml
https://github.com/openshift/machine-config-operator/blob/master/templates/worker/01-worker-kubelet/_base/files/kubelet.yaml

There are some feature gates that need to be removed in order for
the KubeletConfiguration to be valid with the new api.
@cybertron
Copy link
Member Author

To resolve the unit test failures, you just need to remove the CSIMigrationAWS: false, CSIMigrationGCE: false, and PodSecurity: true lines from both of these template files: https://github.com/openshift/machine-config-operator/blob/master/templates/master/01-master-kubelet/_base/files/kubelet.yaml https://github.com/openshift/machine-config-operator/blob/master/templates/worker/01-worker-kubelet/_base/files/kubelet.yaml

Shoot, I knew that and forgot to do it on this branch. Thanks for the reminder.

@cybertron
Copy link
Member Author

/test e2e-metal-ipi
/test e2e-vsphere-upgrade

The singular API and Ingress VIP fields are deprecated and cause
verification failures. The same value can be found in the first
entry in the plural VIPs version of the fields.
@nee1esh
Copy link

nee1esh commented Sep 1, 2022

/retest-required

@cybertron
Copy link
Member Author

/hold

This also has a dependency on openshift/installer#5798, which is approved but waiting for ci passes.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 1, 2022
@sinnykumari
Copy link
Contributor

openshift/installer#5798 has been merged, so this should be good to go.
/test e2e-metal-ipi
/test e2e-vsphere-upgrade
/test e2e-openstack

@soltysh soltysh mentioned this pull request Sep 2, 2022
@cybertron
Copy link
Member Author

/retest

We're probably good here since metal-ipi passed and the openstack and vsphere failures look unrelated, but I'd really prefer to see them pass.

@cybertron
Copy link
Member Author

/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-upi

@dougsland
Copy link
Contributor

LGTM, just nit comments. The CI breaking as there is dependency.

pkg/controller/template/render.go Show resolved Hide resolved
pkg/controller/template/render.go Outdated Show resolved Hide resolved
pkg/operator/render.go Outdated Show resolved Hide resolved
pkg/operator/render.go Outdated Show resolved Hide resolved
pkg/operator/render.go Outdated Show resolved Hide resolved
@cybertron
Copy link
Member Author

I noticed a possible issue with the vsphere logic, so I modified that to handle either nil values or zero-length arrays. Hopefully that will get those jobs passing.

/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-upi
/test e2e-metal-ipi
/test e2e-vsphere-upgrade

@nee1esh
Copy link

nee1esh commented Sep 3, 2022

/retest-required

@@ -26,7 +26,7 @@ contents: |
{{ .Images.baremetalRuntimeCfgImage }} \
node-ip \
set --retry-on-failure \
{{ onPremPlatformAPIServerInternalIP . }}; \
{{- range onPremPlatformAPIServerInternalIPs . }}{{.}} {{end}}; \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cybertron is it intended, to suppress the newline here?

Copy link
Member

@mandre mandre Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearly not, that's the error, and what is causing the IP to be prefix with the \ char from the previous line.
I suggest you also add quotes around the IP address:

Suggested change
{{- range onPremPlatformAPIServerInternalIPs . }}{{.}} {{end}}; \
{{range onPremPlatformAPIServerInternalIPs . }}"{{.}}" {{end}}; \

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I wonder how that ever worked. I wonder if we just got lucky and the default kubelet behavior was working even though nodeip-configuration failed? In any case, I've removed that and pushed a new version. Let's see what ci says.

This isn't needed and breaks the formatting of the service.
Copy link
Member Author

@cybertron cybertron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/test e2e-metal-ipi
/test e2e-metal-ipi-ovn-ipv6
/test e2e-metal-ipi-ovn-dualstack
/test e2e-ovirt-upgrade

return cfg.Infra.Status.PlatformStatus.VSphere.APIServerInternalIP, nil
if len(cfg.Infra.Status.PlatformStatus.VSphere.APIServerInternalIPs) > 0 {
return cfg.Infra.Status.PlatformStatus.VSphere.APIServerInternalIPs[0],
nil
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, didn't mean to do that. Fixed.

@@ -26,7 +26,7 @@ contents: |
{{ .Images.baremetalRuntimeCfgImage }} \
node-ip \
set --retry-on-failure \
{{ onPremPlatformAPIServerInternalIP . }}; \
{{- range onPremPlatformAPIServerInternalIPs . }}{{.}} {{end}}; \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I wonder how that ever worked. I wonder if we just got lucky and the default kubelet behavior was working even though nodeip-configuration failed? In any case, I've removed that and pushed a new version. Let's see what ci says.

@sinnykumari
Copy link
Contributor

e2e-metal-ipi-ovn-ipv6 test is passing with latest changes. latest commit only affects ipv6 changes, so I would say failing tests are unrelated. Adding my approval, feel free to add lgtm when it looks fine to on-prem team
/approve

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 13, 2022
@cybertron
Copy link
Member Author

I forgot to run vsphere-upi again, but since the latest change only affects IPI that should be okay. It looks like all of the metal scenarios are passing now, so I think we're good?

/retest-required

@jcpowermac
Copy link
Contributor

/test e2e-vsphere-upi

@dougsland
Copy link
Contributor

/retest

@mandre
Copy link
Member

mandre commented Sep 14, 2022

Inspecting the latest e2e-vsphere-upi failure, we can see that the network operator entered a crash loop:

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1b840e7]

goroutine 1026 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	sigs.k8s.io/controller-runtime@v0.12.0/pkg/internal/controller/controller.go:118 +0x1f4
panic({0x21a7ec0, 0x3cd09f0})
	runtime/panic.go:838 +0x207
github.com/openshift/cluster-network-operator/pkg/controller/infrastructureconfig.(*apiAndIngressVipsSynchronizer).VipsSynchronize(0x25be4dc, 0xc000df4000?)
	github.com/openshift/cluster-network-operator/pkg/controller/infrastructureconfig/sync_vips.go:34 +0x107
github.com/openshift/cluster-network-operator/pkg/controller/infrastructureconfig.(*ReconcileInfrastructureConfig).Reconcile(0xc000bf9cb0, {0x295dd10, 0xc000df2060}, {{{0x0?, 0x10?}, {0xc0004d6ed0?, 0x413f07?}}})
	github.com/openshift/cluster-network-operator/pkg/controller/infrastructureconfig/infrastructureconfig_controller.go:93 +0x368
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x295dc68?, {0x295dd10?, 0xc000df2060?}, {{{0x0?, 0x2411bc0?}, {0xc0004d6ed0?, 0x4095d4?}}})
	sigs.k8s.io/controller-runtime@v0.12.0/pkg/internal/controller/controller.go:121 +0xc8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000c132c0, {0x295dc68, 0xc000bab140}, {0x228e8a0?, 0xc000bc80e0?})
	sigs.k8s.io/controller-runtime@v0.12.0/pkg/internal/controller/controller.go:320 +0x33c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000c132c0, {0x295dc68, 0xc000bab140})
	sigs.k8s.io/controller-runtime@v0.12.0/pkg/internal/controller/controller.go:273 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	sigs.k8s.io/controller-runtime@v0.12.0/pkg/internal/controller/controller.go:234 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	sigs.k8s.io/controller-runtime@v0.12.0/pkg/internal/controller/controller.go:230 +0x325

Meaning that we were missing Status.PlatformStatus.VSphere.APIServerInternalIPs in the cluster infrastructure object. Not sure if this is a known bug, or perhaps caused by this patch. Someone should look into it.

edit: Indeed, the must gather confirms that Status.PlatformStatus.VSphere is missing from the infrastructure object.

/retest

@creydr
Copy link
Member

creydr commented Sep 14, 2022

... Indeed, the must gather confirms that Status.PlatformStatus.VSphere is missing from the infrastructure object.

In the VIP sync logic in CNO I wasn't aware of the case, that VSphere UPI doesn't populate the VSphere field. I created a patch for CNO: openshift/cluster-network-operator#1558

@jcpowermac
Copy link
Contributor

jcpowermac commented Sep 14, 2022

... Indeed, the must gather confirms that Status.PlatformStatus.VSphere is missing from the infrastructure object.

In the VIP sync logic in CNO I wasn't aware of the case, that VSphere UPI doesn't populate the VSphere field. I created a patch for CNO: openshift/cluster-network-operator#1558

Right, vSphere UPI doesn't use the static pods for API lb/keepalived and ingress keepalived. The expectation is the user creates an LB before install.

@jcpowermac
Copy link
Contributor

jcpowermac commented Sep 14, 2022

Meaning that we were missing Status.PlatformStatus.VSphere.APIServerInternalIPs in the cluster infrastructure object. Not sure if this is a known bug, or perhaps caused by this patch. Someone should look into it.

Not a bug, by design. When developing vSphere IPI it was expected that UPI would stay as-is and not require the api or ingress vip.

@yuqi-zhang yuqi-zhang mentioned this pull request Sep 14, 2022
@sinnykumari sinnykumari mentioned this pull request Sep 15, 2022
@sinnykumari
Copy link
Contributor

Retesting as openshift/cluster-network-operator#1558 has been merged
/test e2e-vsphere-upi

@sinnykumari
Copy link
Contributor

/test e2e-vsphere-upi

@kikisdeliveryservice
Copy link
Contributor

Vsphere UPI job passed @jcpowermac can this merge?

@jcpowermac
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 15, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 15, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cybertron, jcpowermac, sinnykumari

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kikisdeliveryservice
Copy link
Contributor

/retest-required

@kikisdeliveryservice
Copy link
Contributor

Infra failure:

level=error msg=Error: Error launching source instance: InvalidNetworkInterfaceID.NotFound: The networkInterface ID 'eni-010489741e0d5aa99' does not exist
level=error msg=	status code: 400, request id: 92c1d8a0-b645-4212-9ddc-bb6eeac33fea
level=error
level=error msg=  with module.masters.aws_instance.master[2],
level=error msg=  on master/main.tf line 129, in resource "aws_instance" "master":
level=error msg= 129: resource "aws_instance" "master" {
level=error
level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "cluster" stage: failed to create cluster: failed to apply Terraform: exit status 1 

report and retesting
/test e2e-agnostic-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 16, 2022

@cybertron: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-upgrade 0357c4b link false /test e2e-vsphere-upgrade
ci/prow/e2e-ovirt-upgrade a586894 link false /test e2e-ovirt-upgrade
ci/prow/e2e-openstack a586894 link false /test e2e-openstack

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD a985910 and 2 for PR HEAD a586894 in total

@openshift-merge-robot openshift-merge-robot merged commit 8050e55 into openshift:master Sep 16, 2022
mandre added a commit to shiftstack/machine-config-operator that referenced this pull request Sep 21, 2022
The keepalived config template change for workers was missing from
openshift#3269. This
adds it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.