Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine controller permanently moves Floating IPs over controllers #1265

Closed
zioc opened this issue Jun 17, 2022 · 10 comments · Fixed by #1276
Closed

Machine controller permanently moves Floating IPs over controllers #1265

zioc opened this issue Jun 17, 2022 · 10 comments · Fixed by #1276
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@zioc
Copy link
Contributor

zioc commented Jun 17, 2022

/kind bug

I was attemtping to use CAPI/CAPO on an openstack platform that doesn't provides loadbalancer API, and expected to use floatingIP
to reach the API server.

For that purpose I was using following parameters in OpenstackCluster:
disableAPIServerFloatingIP: false
apiServerLoadBalancer: {}

It turns out that openstackmachine controller is permanently moving floating IP overs the master nodes, causing API connectivity interruptions. Here is an extract from the capo controller logs:

ubuntu@capo:~$ kubectl logs  -n capo-system capo-controller-manager-57dc99755b-xrpch |grep 'Associated floating'    
I0616 13:26:10.233715       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.6 with port f2939e1b-e9c2-4e02-b1e9-da69087dd66e" "object"={"kind":"OpenStackCluster","namespace":"default","name":"management-cluster","uid":"26a135e7-f825-4e5a-8dd6-9c6fe6050fed","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha4","resourceVersion":"1460"} "reason"="SuccessfulAssociateFloatingIP"
I0616 13:27:41.963167       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.6 with port 1c81e094-8470-472a-bb7f-2c0337cedc7d" "object"={"kind":"OpenStackCluster","namespace":"default","name":"management-cluster","uid":"26a135e7-f825-4e5a-8dd6-9c6fe6050fed","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha4","resourceVersion":"1460"} "reason"="SuccessfulAssociateFloatingIP"
I0616 13:28:00.712943       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.6 with port f2939e1b-e9c2-4e02-b1e9-da69087dd66e" "object"={"kind":"OpenStackCluster","namespace":"default","name":"management-cluster","uid":"26a135e7-f825-4e5a-8dd6-9c6fe6050fed","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha4","resourceVersion":"1460"} "reason"="SuccessfulAssociateFloatingIP"
I0616 13:28:22.729372       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.6 with port 1c81e094-8470-472a-bb7f-2c0337cedc7d" "object"={"kind":"OpenStackCluster","namespace":"default","name":"management-cluster","uid":"26a135e7-f825-4e5a-8dd6-9c6fe6050fed","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha4","resourceVersion":"1460"} "reason"="SuccessfulAssociateFloatingIP"
I0616 13:28:37.991424       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.6 with port f2939e1b-e9c2-4e02-b1e9-da69087dd66e" "object"={"kind":"OpenStackCluster","namespace":"default","name":"management-cluster","uid":"26a135e7-f825-4e5a-8dd6-9c6fe6050fed","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha4","resourceVersion":"1460"} "reason"="SuccessfulAssociateFloatingIP"
I0616 13:28:39.253131       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.6 with port 1c81e094-8470-472a-bb7f-2c0337cedc7d" "object"={"kind":"OpenStackCluster","namespace":"default","name":"management-cluster","uid":"26a135e7-f825-4e5a-8dd6-9c6fe6050fed","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha4","resourceVersion":"1460"} "reason"="SuccessfulAssociateFloatingIP"
I0616 13:28:42.545872       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.6 with port f2939e1b-e9c2-4e02-b1e9-da69087dd66e" "object"={"kind":"OpenStackCluster","namespace":"default","name":"management-cluster","uid":"26a135e7-f825-4e5a-8dd6-9c6fe6050fed","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha4","resourceVersion":"1460"} "reason"="SuccessfulAssociateFloatingIP"
I0616 13:28:43.691985       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.6 with port 1c81e094-8470-472a-bb7f-2c0337cedc7d" "object"={"kind":"OpenStackCluster","namespace":"default","name":"management-cluster","uid":"26a135e7-f825-4e5a-8dd6-9c6fe6050fed","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha4","resourceVersion":"1460"} "reason"="SuccessfulAssociateFloatingIP"
I0616 13:28:46.827933       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.6 with port f2939e1b-e9c2-4e02-b1e9-da69087dd66e" "object"={"kind":"OpenStackCluster","namespace":"default","name":"management-cluster","uid":"26a135e7-f825-4e5a-8dd6-9c6fe6050fed","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha4","resourceVersion":"1460"} "reason"="SuccessfulAssociateFloatingIP"
I0616 13:28:47.824140       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.6 with port 1c81e094-8470-472a-bb7f-2c0337cedc7d" "object"={"kind":"OpenStackCluster","namespace":"default","name":"management-cluster","uid":"26a135e7-f825-4e5a-8dd6-9c6fe6050fed","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha4","resourceVersion":"1460"} "reason"="SuccessfulAssociateFloatingIP"
I0616 13:28:50.996189       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.6 with port f2939e1b-e9c2-4e02-b1e9-da69087dd66e" "object"={"kind":"OpenStackCluster","namespace":"default","name":"management-cluster","uid":"26a135e7-f825-4e5a-8dd6-9c6fe6050fed","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha4","resourceVersion":"1460"} "reason"="SuccessfulAssociateFloatingIP"
I0616 13:28:51.985285       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.6 with port 1c81e094-8470-472a-bb7f-2c0337cedc7d" "object"={"kind":"OpenStackCluster","namespace":"default","name":"management-cluster","uid":"26a135e7-f825-4e5a-8dd6-9c6fe6050fed","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha4","resourceVersion":"1460"} "reason"="SuccessfulAssociateFloatingIP"
I0616 13:29:11.154365       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.6 with port f2939e1b-e9c2-4e02-b1e9-da69087dd66e" "object"={"kind":"OpenStackCluster","namespace":"default","name":"management-cluster","uid":"26a135e7-f825-4e5a-8dd6-9c6fe6050fed","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha4","resourceVersion":"1460"} "reason"="SuccessfulAssociateFloatingIP"
I0616 13:29:34.674349       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.6 with port 350f32a4-dc37-415a-8c5c-a2c99248f7d1" "object"={"kind":"OpenStackCluster","namespace":"default","name":"management-cluster","uid":"26a135e7-f825-4e5a-8dd6-9c6fe6050fed","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha4","resourceVersion":"1460"} "reason"="SuccessfulAssociateFloatingIP"
I0616 13:44:18.546255       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.6 with port f2939e1b-e9c2-4e02-b1e9-da69087dd66e" "object"={"kind":"OpenStackCluster","namespace":"default","name":"management-cluster","uid":"26a135e7-f825-4e5a-8dd6-9c6fe6050fed","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha4","resourceVersion":"1460"} "reason"="SuccessfulAssociateFloatingIP"
I0616 13:44:19.245590       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.6 with port 1c81e094-8470-472a-bb7f-2c0337cedc7d" "object"={"kind":"OpenStackCluster","namespace":"default","name":"management-cluster","uid":"26a135e7-f825-4e5a-8dd6-9c6fe6050fed","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha4","resourceVersion":"1460"} "reason"="SuccessfulAssociateFloatingIP"
I0616 13:44:23.274658       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.6 with port f2939e1b-e9c2-4e02-b1e9-da69087dd66e" "object"={"kind":"OpenStackCluster","namespace":"default","name":"management-cluster","uid":"26a135e7-f825-4e5a-8dd6-9c6fe6050fed","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha4","resourceVersion":"1460"} "reason"="SuccessfulAssociateFloatingIP"
I0616 13:44:23.638667       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.6 with port 350f32a4-dc37-415a-8c5c-a2c99248f7d1" "object"={"kind":"OpenStackCluster","namespace":"default","name":"management-cluster","uid":"26a135e7-f825-4e5a-8dd6-9c6fe6050fed","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha4","resourceVersion":"1460"} "reason"="SuccessfulAssociateFloatingIP"
I0616 13:44:25.308623       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.6 with port 1c81e094-8470-472a-bb7f-2c0337cedc7d" "object"={"kind":"OpenStackCluster","namespace":"default","name":"management-cluster","uid":"26a135e7-f825-4e5a-8dd6-9c6fe6050fed","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha4","resourceVersion":"1460"} "reason"="SuccessfulAssociateFloatingIP"
I0616 13:44:32.747150       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.6 with port f2939e1b-e9c2-4e02-b1e9-da69087dd66e" "object"={"kind":"OpenStackCluster","namespace":"default","name":"management-cluster","uid":"26a135e7-f825-4e5a-8dd6-9c6fe6050fed","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha4","resourceVersion":"1460"} "reason"="SuccessfulAssociateFloatingIP"

This issue was observed with capo v0.6.3 and v0.5.3 (I was thinking that this commit could have introduced this bug, but it doesn't)

The cause of the issue is quite straightforward: as we don't check if floatingIP is already associated to an healthy control-plane machine, each machine will attach the floating IP to its port when it is reconcilated..

It even often breaks deployment with multiple controllers, as floating IP move to the machine being provisionned is racing with kubeadm join that fails to reach the cluster API.

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 17, 2022
@jichenjc
Copy link
Contributor

we have this logic

which means the port will be checked with the floating ip and port ID ,
do you mind search for this log and see what it printed?

s.scope.Logger.Info("Associating floating IP", "id", fp.ID, "ip", fp.FloatingIP)

@jichenjc
Copy link
Contributor

with my settings

I can see our code logic above takes effect

  apiServerLoadBalancer:
    enabled: false

I0620 09:49:36.497758       1 floatingip.go:123] controller/openstackmachine "msg"="Associating floating IP" "cluster"="capi-quickstart" "machine"="capi-quickstart-control-plane-cw7ql" "name"="capi-quickstart-control-plane-fndtx" "namespace"="default" "openStackCluster"="capi-quickstart" "openStackMachine"="capi-quickstart-control-plane-fndtx" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" "id"="9d2ae59e-8bda-41ec-99eb-517a18b5814a" "ip"="172.24.4.60"
I0620 09:49:36.497855       1 openstackmachine_controller.go:391] controller/openstackmachine "msg"="Reconciled Machine create successfully" "cluster"="capi-quickstart" "machine"="capi-quickstart-control-plane-cw7ql" "name"="capi-quickstart-control-plane-fndtx" "namespace"="default" "openStackCluster"="capi-quickstart" "openStackMachine"="capi-quickstart-control-plane-fndtx" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine"
I0620 09:49:36.498059       1 recorder.go:103] events "msg"="Normal"  "message"="Floating IP 172.24.4.60 already associated with port 23b5b3df-e6bf-4c39-a767-97c108d9005b" "object"={"kind":"OpenStackMachine","namespace":"default","name":"capi-quickstart-control-plane-fndtx","uid":"8cc30528-2f31-4b6b-b193-5676173a4bd4","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha5","resourceVersion":"1287"} "reason"="Successfulassociatefloatingip"

@zioc
Copy link
Contributor Author

zioc commented Jun 20, 2022

I may misunderstood, but from what I see in current controller code, as soon as a control plane machine is reconcilied, it attempts to associate the floating IP with the machine that is being reconciliated.

With a single controller, it will works properly as the condition that you were pointing is always true. But as soon as another controller VM is reconcilied, it'll see that the floating IP isn't bound to its own port and will perform the association. Here is a complete extract of the logs of my capo controller:

I0620 12:06:27.690227       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.130 with port 904f4da7-2fa2-445a-bc8c-857be9bae322" "object"={"kind":"OpenStackMachine","namespace":"default","name":"management-cluster-control-plane-9wjjt","uid":"caccbe1e-20b8-40c7-a6ce-8a8d79b7ccca","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha5","resourceVersion":"69398"} "reason"="Successfulassociatefloatingip"
I0620 12:08:24.809144       1 openstackmachine_controller.go:300] controller/openstackmachine "msg"="Reconciling Machine" "cluster"="management-cluster" "machine"="management-cluster-control-plane-dcdck" "name"="management-cluster-control-plane-9wjjt" "namespace"="default" "openStackCluster"="management-cluster" "openStackMachine"="management-cluster-control-plane-9wjjt" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" 
I0620 12:08:24.826477       1 openstackmachine_controller.go:300] controller/openstackmachine "msg"="Reconciling Machine" "cluster"="management-cluster" "machine"="management-cluster-control-plane-pfxgs" "name"="management-cluster-control-plane-zmsnn" "namespace"="default" "openStackCluster"="management-cluster" "openStackMachine"="management-cluster-control-plane-zmsnn" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" 
I0620 12:08:24.901090       1 openstackmachine_controller.go:300] controller/openstackmachine "msg"="Reconciling Machine" "cluster"="management-cluster" "machine"="management-cluster-control-plane-fjgkk" "name"="management-cluster-control-plane-stckm" "namespace"="default" "openStackCluster"="management-cluster" "openStackMachine"="management-cluster-control-plane-stckm" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" 
I0620 12:08:25.291480       1 openstackmachine_controller.go:345] controller/openstackmachine "msg"="Machine instance is ACTIVE" "cluster"="management-cluster" "machine"="management-cluster-control-plane-pfxgs" "name"="management-cluster-control-plane-zmsnn" "namespace"="default" "openStackCluster"="management-cluster" "openStackMachine"="management-cluster-control-plane-zmsnn" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" "instance-id"="52d995c6-93f6-4d7d-84e2-72f2adbdfef2"
I0620 12:08:25.322556       1 openstackmachine_controller.go:345] controller/openstackmachine "msg"="Machine instance is ACTIVE" "cluster"="management-cluster" "machine"="management-cluster-control-plane-dcdck" "name"="management-cluster-control-plane-9wjjt" "namespace"="default" "openStackCluster"="management-cluster" "openStackMachine"="management-cluster-control-plane-9wjjt" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" "instance-id"="fd2d7f67-db4f-44bb-85c8-7d06dbfaf02e"
I0620 12:08:25.359021       1 openstackmachine_controller.go:345] controller/openstackmachine "msg"="Machine instance is ACTIVE" "cluster"="management-cluster" "machine"="management-cluster-control-plane-fjgkk" "name"="management-cluster-control-plane-stckm" "namespace"="default" "openStackCluster"="management-cluster" "openStackMachine"="management-cluster-control-plane-stckm" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" "instance-id"="984bde3c-b86b-4725-bad2-5121eccb31eb"
I0620 12:08:25.558501       1 floatingip.go:123] controller/openstackmachine "msg"="Associating floating IP" "cluster"="management-cluster" "machine"="management-cluster-control-plane-pfxgs" "name"="management-cluster-control-plane-zmsnn" "namespace"="default" "openStackCluster"="management-cluster" "openStackMachine"="management-cluster-control-plane-zmsnn" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" "id"="f8a26655-c5ec-4ec9-aaf7-b5845275ae3e" "ip"="172.20.129.130"
I0620 12:08:25.596820       1 floatingip.go:123] controller/openstackmachine "msg"="Associating floating IP" "cluster"="management-cluster" "machine"="management-cluster-control-plane-dcdck" "name"="management-cluster-control-plane-9wjjt" "namespace"="default" "openStackCluster"="management-cluster" "openStackMachine"="management-cluster-control-plane-9wjjt" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" "id"="f8a26655-c5ec-4ec9-aaf7-b5845275ae3e" "ip"="172.20.129.130"
I0620 12:08:25.598088       1 recorder.go:103] events "msg"="Normal"  "message"="Floating IP 172.20.129.130 already associated with port 904f4da7-2fa2-445a-bc8c-857be9bae322" "object"={"kind":"OpenStackMachine","namespace":"default","name":"management-cluster-control-plane-9wjjt","uid":"caccbe1e-20b8-40c7-a6ce-8a8d79b7ccca","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha5","resourceVersion":"69398"} "reason"="Successfulassociatefloatingip"
I0620 12:08:25.598594       1 openstackmachine_controller.go:391] controller/openstackmachine "msg"="Reconciled Machine create successfully" "cluster"="management-cluster" "machine"="management-cluster-control-plane-dcdck" "name"="management-cluster-control-plane-9wjjt" "namespace"="default" "openStackCluster"="management-cluster" "openStackMachine"="management-cluster-control-plane-9wjjt" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" 
I0620 12:08:25.659656       1 floatingip.go:123] controller/openstackmachine "msg"="Associating floating IP" "cluster"="management-cluster" "machine"="management-cluster-control-plane-fjgkk" "name"="management-cluster-control-plane-stckm" "namespace"="default" "openStackCluster"="management-cluster" "openStackMachine"="management-cluster-control-plane-stckm" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" "id"="f8a26655-c5ec-4ec9-aaf7-b5845275ae3e" "ip"="172.20.129.130"
I0620 12:08:26.177295       1 openstackmachine_controller.go:300] controller/openstackmachine "msg"="Reconciling Machine" "cluster"="management-cluster" "machine"="management-cluster-control-plane-dcdck" "name"="management-cluster-control-plane-9wjjt" "namespace"="default" "openStackCluster"="management-cluster" "openStackMachine"="management-cluster-control-plane-9wjjt" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" 
I0620 12:08:26.592385       1 openstackmachine_controller.go:345] controller/openstackmachine "msg"="Machine instance is ACTIVE" "cluster"="management-cluster" "machine"="management-cluster-control-plane-dcdck" "name"="management-cluster-control-plane-9wjjt" "namespace"="default" "openStackCluster"="management-cluster" "openStackMachine"="management-cluster-control-plane-9wjjt" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" "instance-id"="fd2d7f67-db4f-44bb-85c8-7d06dbfaf02e"
I0620 12:08:26.849445       1 floatingip.go:123] controller/openstackmachine "msg"="Associating floating IP" "cluster"="management-cluster" "machine"="management-cluster-control-plane-dcdck" "name"="management-cluster-control-plane-9wjjt" "namespace"="default" "openStackCluster"="management-cluster" "openStackMachine"="management-cluster-control-plane-9wjjt" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" "id"="f8a26655-c5ec-4ec9-aaf7-b5845275ae3e" "ip"="172.20.129.130"
I0620 12:08:28.412020       1 floatingip.go:181] controller/openstackmachine "msg"="Waiting for floating IP" "cluster"="management-cluster" "machine"="management-cluster-control-plane-pfxgs" "name"="management-cluster-control-plane-zmsnn" "namespace"="default" "openStackCluster"="management-cluster" "openStackMachine"="management-cluster-control-plane-zmsnn" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" "id"="f8a26655-c5ec-4ec9-aaf7-b5845275ae3e" "targetStatus"="ACTIVE"
I0620 12:08:28.456971       1 openstackmachine_controller.go:391] controller/openstackmachine "msg"="Reconciled Machine create successfully" "cluster"="management-cluster" "machine"="management-cluster-control-plane-pfxgs" "name"="management-cluster-control-plane-zmsnn" "namespace"="default" "openStackCluster"="management-cluster" "openStackMachine"="management-cluster-control-plane-zmsnn" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" 
I0620 12:08:28.457306       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.130 with port c06d85b4-776c-45ea-aff3-8a1a569d101c" "object"={"kind":"OpenStackMachine","namespace":"default","name":"management-cluster-control-plane-zmsnn","uid":"4929f04f-e781-4b20-9ccc-80f0e6e98266","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha5","resourceVersion":"69382"} "reason"="Successfulassociatefloatingip"
I0620 12:08:28.987837       1 floatingip.go:181] controller/openstackmachine "msg"="Waiting for floating IP" "cluster"="management-cluster" "machine"="management-cluster-control-plane-fjgkk" "name"="management-cluster-control-plane-stckm" "namespace"="default" "openStackCluster"="management-cluster" "openStackMachine"="management-cluster-control-plane-stckm" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" "id"="f8a26655-c5ec-4ec9-aaf7-b5845275ae3e" "targetStatus"="ACTIVE"
I0620 12:08:29.026182       1 openstackmachine_controller.go:391] controller/openstackmachine "msg"="Reconciled Machine create successfully" "cluster"="management-cluster" "machine"="management-cluster-control-plane-fjgkk" "name"="management-cluster-control-plane-stckm" "namespace"="default" "openStackCluster"="management-cluster" "openStackMachine"="management-cluster-control-plane-stckm" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" 
I0620 12:08:29.026898       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.130 with port 4b4829cb-0ddd-4c46-ac90-c0f8e251499c" "object"={"kind":"OpenStackMachine","namespace":"default","name":"management-cluster-control-plane-stckm","uid":"7f86ecf3-9379-4617-9eb2-1b8be9981f08","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha5","resourceVersion":"3686"} "reason"="Successfulassociatefloatingip"
I0620 12:08:29.032908       1 openstackmachine_controller.go:300] controller/openstackmachine "msg"="Reconciling Machine" "cluster"="management-cluster" "machine"="management-cluster-control-plane-pfxgs" "name"="management-cluster-control-plane-zmsnn" "namespace"="default" "openStackCluster"="management-cluster" "openStackMachine"="management-cluster-control-plane-zmsnn" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" 
I0620 12:08:29.534596       1 openstackmachine_controller.go:345] controller/openstackmachine "msg"="Machine instance is ACTIVE" "cluster"="management-cluster" "machine"="management-cluster-control-plane-pfxgs" "name"="management-cluster-control-plane-zmsnn" "namespace"="default" "openStackCluster"="management-cluster" "openStackMachine"="management-cluster-control-plane-zmsnn" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" "instance-id"="52d995c6-93f6-4d7d-84e2-72f2adbdfef2"
I0620 12:08:29.903935       1 floatingip.go:181] controller/openstackmachine "msg"="Waiting for floating IP" "cluster"="management-cluster" "machine"="management-cluster-control-plane-dcdck" "name"="management-cluster-control-plane-9wjjt" "namespace"="default" "openStackCluster"="management-cluster" "openStackMachine"="management-cluster-control-plane-9wjjt" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" "id"="f8a26655-c5ec-4ec9-aaf7-b5845275ae3e" "targetStatus"="ACTIVE"
I0620 12:08:29.922421       1 floatingip.go:123] controller/openstackmachine "msg"="Associating floating IP" "cluster"="management-cluster" "machine"="management-cluster-control-plane-pfxgs" "name"="management-cluster-control-plane-zmsnn" "namespace"="default" "openStackCluster"="management-cluster" "openStackMachine"="management-cluster-control-plane-zmsnn" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" "id"="f8a26655-c5ec-4ec9-aaf7-b5845275ae3e" "ip"="172.20.129.130"
I0620 12:08:29.956696       1 openstackmachine_controller.go:391] controller/openstackmachine "msg"="Reconciled Machine create successfully" "cluster"="management-cluster" "machine"="management-cluster-control-plane-dcdck" "name"="management-cluster-control-plane-9wjjt" "namespace"="default" "openStackCluster"="management-cluster" "openStackMachine"="management-cluster-control-plane-9wjjt" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" 
I0620 12:08:29.956788       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.130 with port 904f4da7-2fa2-445a-bc8c-857be9bae322" "object"={"kind":"OpenStackMachine","namespace":"default","name":"management-cluster-control-plane-9wjjt","uid":"caccbe1e-20b8-40c7-a6ce-8a8d79b7ccca","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha5","resourceVersion":"69869"} "reason"="Successfulassociatefloatingip"
I0620 12:08:32.827251       1 floatingip.go:181] controller/openstackmachine "msg"="Waiting for floating IP" "cluster"="management-cluster" "machine"="management-cluster-control-plane-pfxgs" "name"="management-cluster-control-plane-zmsnn" "namespace"="default" "openStackCluster"="management-cluster" "openStackMachine"="management-cluster-control-plane-zmsnn" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" "id"="f8a26655-c5ec-4ec9-aaf7-b5845275ae3e" "targetStatus"="ACTIVE"
I0620 12:08:32.875912       1 openstackmachine_controller.go:391] controller/openstackmachine "msg"="Reconciled Machine create successfully" "cluster"="management-cluster" "machine"="management-cluster-control-plane-pfxgs" "name"="management-cluster-control-plane-zmsnn" "namespace"="default" "openStackCluster"="management-cluster" "openStackMachine"="management-cluster-control-plane-zmsnn" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" 
I0620 12:08:32.876760       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.130 with port c06d85b4-776c-45ea-aff3-8a1a569d101c" "object"={"kind":"OpenStackMachine","namespace":"default","name":"management-cluster-control-plane-zmsnn","uid":"4929f04f-e781-4b20-9ccc-80f0e6e98266","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha5","resourceVersion":"69884"} "reason"="Successfulassociatefloatingip"

At the begining we see that floating IP is associated with machine management-cluster-control-plane-9wjjt:

I0620 12:06:27.690227       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.130 with port 904f4da7-2fa2-445a-bc8c-857be9bae322" "object"={"kind":"OpenStackMachine","namespace":"default","name":"management-cluster-control-plane-9wjjt","uid":"caccbe1e-20b8-40c7-a6ce-8a8d79b7ccca","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha5","resourceVersion":"69398"} "reason"="Successfulassociatefloatingip"

And indeed if that machine is reconciliated again, it says that the floating IP is already associated

I0620 12:08:25.598088       1 recorder.go:103] events "msg"="Normal"  "message"="Floating IP 172.20.129.130 already associated with port 904f4da7-2fa2-445a-bc8c-857be9bae322" "object"={"kind":"OpenStackMachine","namespace":"default","name":"management-cluster-control-plane-9wjjt","uid":"caccbe1e-20b8-40c7-a6ce-8a8d79b7ccca","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha5","resourceVersion":"69398"} "reason"="Successfulassociatefloatingip"

But as soon as another machine (here management-cluster-control-plane-zmsnn) gets reconcilied, as it has a different port id, it associated the floating IP to itself.

I0620 12:08:28.457306       1 recorder.go:103] events "msg"="Normal"  "message"="Associated floating IP 172.20.129.130 with port c06d85b4-776c-45ea-aff3-8a1a569d101c" "object"={"kind":"OpenStackMachine","namespace":"default","name":"management-cluster-control-plane-zmsnn","uid":"4929f04f-e781-4b20-9ccc-80f0e6e98266","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha5","resourceVersion":"69382"} "reason"="Successfulassociatefloatingip"

IMHO, it'd be better to test if fp.PortID != nil here in order to check if floating IP has already been associated with a control-plane node.

@mdbooth
Copy link
Contributor

mdbooth commented Jun 20, 2022

It sounds like we've got a problem of expressiveness here. We expect all the control plane nodes to be the same and deterministic, but that isn't the case when you've got multiple control plane nodes with only one of them having a floating ip. With our current model we should probably be restricting the control plane to a single node when not using a load balancer, because it looks like we can't support multiple nodes. I don't think we can use the if fp.PortID != nil check because this would make the configuration of the control plane heterogeneous and non-deterministic. For example, during an upgrade you might:

  • Roll out a new control plane node (doesn't get FIP because it's already attached)
  • Remove existing control plane node, happens to be the one with the attached FIP
    At this point the FIP isn't attached anywhere and the cluster is down until you fix it manually.

I do think you have a valid use case, though. In fact we have this same use case in OpenShift: we don't use Octavia for the API vip either. Instead we create a FIP with a port on the control plane network which isn't bound to any of the control plane machines. We then use keepalived to float the VIP dynamically across all 3 control plane nodes. We're also currently looking at adding more api VIP options, including floating the VIP between control plane nodes which don't share an L2 using BGP and ECMP. I would personally be very much in favour of also adding these options to CAPO, but I'm not sure who would work on them. Is this something you might be interested in?

@zioc
Copy link
Contributor Author

zioc commented Jun 20, 2022

I agree that having a single floating IP pointing to a specific control plane node is definitively not a production grade solution, and VRRP or BGP based options would be much better! For that purpose, I was wondering whether MetalLB could be deployed as a ClusterResourceSet and used for external API access? I haven't yet tried that option, but i'd like to give it a try if it makes sense.

Regarding the floating IP issue, when we get to the point where the FIP isn't attached anywhere and the cluster is down until you fix it manually. wouldn't we have fp.PortID == nil at this time? (or maybe fp.ID != "ACTIVE") and FIP would be bound to another controller node with suggested fix?

But even if that works, it's true that external API access failover would rely on machine reconciliation loop, which is not fast, so explicitly limiting to a single control plane node as you suggested would also make sense, for sure.

@mdbooth
Copy link
Contributor

mdbooth commented Jun 20, 2022

I agree that having a single floating IP pointing to a specific control plane node is definitively not a production grade solution, and VRRP or BGP based options would be much better! For that purpose, I was wondering whether MetalLB could be deployed as a ClusterResourceSet and used for external API access? I haven't yet tried that option, but i'd like to give it a try if it makes sense.

Unfortunately not directly that I'm aware. IIUC Metal LB provides service load balancers. i.e. It requires an apiserver to operate and therefore can't loadbalance the api VIP. I'd be delighted to be wrong about that, though, so if you know better please chime in!

Regarding the floating IP issue, when we get to the point where the FIP isn't attached anywhere and the cluster is down until you fix it manually. wouldn't we have fp.PortID == nil at this time? (or maybe fp.ID != "ACTIVE") and FIP would be bound to another controller node with suggested fix?

...

But even if that works, it's true that external API access failover would rely on machine reconciliation loop, which is not fast, so explicitly limiting to a single control plane node as you suggested would also make sense, for sure.

That's the problem. You'd need something to trigger the machine reconciliation loop and, in ideal circumstances, nothing will do that on a static cluster.

@jichenjc
Copy link
Contributor

jichenjc commented Jun 22, 2022

we should probably be restricting the control plane to a single node when not using a load balancer,

I think we should add a check and warning on such (multiple control node with multiple floating ip) for now ..

@zioc
Copy link
Contributor Author

zioc commented Jun 22, 2022

That makes sense. Anyway, if it's only a warning (and not a validation constraint), shouldn't we also try to avoid moving the floating IP around controllers in case thare are multiple one, as suggested?

Sorry if I mis-understand, but I don't get your point on multiple floating IP: there can be only one floating IP that carries the API endpoint, right?

That's the problem. You'd need something to trigger the machine reconciliation loop and, in ideal circumstances, nothing will do that on a static cluster.

Not arguing that it would be a suitable failover mechanism, but from what I understand there's a periodic reconciliation every 10 minutes by default.. So even on static cluster, machine would end up being reconcilied at some point (Btw, this is also one of the causes of this bug, as FIP will never stop being moved over controllers following their respective periodic reconciliations...)

@jichenjc
Copy link
Contributor

Not arguing that it would be a suitable failover mechanism, but from what I understand there's a periodic reconciliation every 10 minutes by default.. So even on static cluster, machine would end up being reconcilied at some point (Btw, this is also one of the causes of this bug, as FIP will never stop being moved over controllers following their respective periodic reconciliations...)

this is correct

| 37bb69a6-4185-4d8a-9b92-64ca4e5c9091 | capi-quickstart-control-plane-22w2z | ACTIVE | -          | Running     | k8s-clusterapi-cluster-default-capi-quickstart=10.6.0.74, 172.24.4.219 |
| 087302c3-c17b-49a4-b360-a1e4fab1bc97 | capi-quickstart-control-plane-28wl8 | ACTIVE | -          | Running     | k8s-clusterapi-cluster-default-capi-quickstart=10.6.0.28    

I noticed that the FIP moved to another node suddenly which looks to me not a correct behavior

+--------------------------------------+-------------------------------------+--------+------------+-------------+------------------------------------------------------------------------+
| ID                                   | Name                                | Status | Task State | Power State | Networks                                                               |
+--------------------------------------+-------------------------------------+--------+------------+-------------+------------------------------------------------------------------------+
| 37bb69a6-4185-4d8a-9b92-64ca4e5c9091 | capi-quickstart-control-plane-22w2z | ACTIVE | -          | Running     | k8s-clusterapi-cluster-default-capi-quickstart=10.6.0.74               |
| 087302c3-c17b-49a4-b360-a1e4fab1bc97 | capi-quickstart-control-plane-28wl8 | ACTIVE | -          | Running     | k8s-clusterapi-cluster-default-capi-quickstart=10.6.0.28, 172.24.4.219 |

so I think unless we can figure out a outbound solution like VIP etc, we'd better disable the >1 controller case if there is no LB involved

@jichenjc
Copy link
Contributor

The replices of control plane is defined in KubeadmControlPlane
technically we can get it but seems we add such restriction doesn't make too much value

so I think it's more reasonable to just stick the FIP to the first VM (so check fp != nil) case
and warn user they need use LB or the floatingip can't guarantee the HA of controlplane case
of course we can assign the floating ip when we use MHC find the node has the FIP is unhealty..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
4 participants