No Control Plane machines came into existence. #10356

adilGhaffarDev · 2024-04-02T07:22:58Z

Which jobs are flaking?

periodic-cluster-api-e2e-main
periodic-cluster-api-e2e-mink8s-main
periodic-cluster-api-e2e-dualstack-and-ipv6-release-1-6
periodic-cluster-api-e2e-mink8s-main
periodic-cluster-api-e2e-release-1-4

Which tests are flaking?

When following the Cluster API quick-start with Ignition Should create a workload cluster
When upgrading a workload cluster using ClusterClass with a HA control plane [ClusterClass] Should create and upgrade a workload cluster and eventually run kubetest
When testing MachineDeployment scale out/in Should successfully scale a MachineDeployment up and down upon changes to the MachineDeployment replica count
When testing clusterctl upgrades using ClusterClass (v1.5=>current) [ClusterClass] Should create a management cluster and then upgrade all the providers
When testing ClusterClass changes [ClusterClass] Should successfully rollout the managed topology upon changes to the ClusterClass

Since when has it been flaking?

Minor flakes with this error have been happening for a long time.

Testgrid link

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main

Reason for failure (if possible)

No response

Anything else we need to know?

No response

Label(s) to be applied

/kind flake
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

sbueringer · 2024-04-03T05:13:23Z

/triage accepted

Thx for reporting

sbueringer · 2024-04-03T05:14:41Z

Would be good to add a link either to a specific failed job or k8s-triage filtered down on this failure.

Just to make it easier to find a failed job

fabriziopandini · 2024-04-11T16:45:44Z

/priority important-soon

chrischdi · 2024-04-12T10:41:24Z

k8s-triage link

chrischdi · 2024-04-16T07:03:26Z

I did hit this issue in a local setup. However, its quite hard to triage, because the machine already got replaced by a new one (I guess because of MHC doing its thing) and the cluster successfully started then.

I still have stuff around if there are ideas to filter information.

chrischdi · 2024-04-17T15:39:49Z

I was able to hit it again and triage a bit.

It turns out that the node itself came up, except the parts which try to hit the load balanced control-plane endpoint.

TLDR: The haproxy load balancer did not forward traffic to the control plane node.

My current theory is:

CAPD wrote the new haproxy config file to the lb container, which includes the first control plane node as backend
- I can confirm that this one was correct in my setup by reading the file again using `docker run cp
CAPD did signal to haproxy to reload the config (by sending SIGHUP)
Afterwards CAPD did still not route requests to the running node.

I was able to "fix" the issue in this case by again sending SIGHUP to haproxy: docker kill -s SIGHUP <container>
Afterwards haproxy did route requests to the running node.

I'm currently testing the following fix locally which is: reading and comparing the configfile in CAPD after writing it and before reloading haproxy:

🐛 CAPD: verify lb config after writing it #10453

Test setup:

GINKGO_FOCUS="PR-Blocking"
GINKGO_SKIP="\[Conformance\]"

So it only runs a single test.

I used prowjob on kind to create a kind cluster and pod yaml, which I then modified (adjusted timeouts + GINKGO_FOCUS + requests, propably other things too).

I then run the loop using ./scripts/test-pj.sh.

All code changes for my setup are here for reference: dfe9d5e

I did some optimisations, like packing all required images to scripts/images.tar and loading them from there instead of building + propably some others to make it faster and rely less on internet to not run into rate limiting stuff too.

chrischdi · 2024-04-18T09:45:00Z

Fixes are merged, let's check next week or so if the error occurs again.

chrischdi · 2024-04-23T10:48:51Z

The merged fix did not help.

chrischdi · 2024-04-24T09:22:14Z

For reference, I did hit the same issue (CAPD load balancer config not active) as described in this comment on a 0.4 => 1.6 => current upgrade test but with a slightly different log:

  STEP: Initializing the workload cluster with older versions of providers @ 04/24/24 08:03:54.723
  INFO: clusterctl init --config /logs/artifacts/repository/clusterctl-config.v1.2.yaml --kubeconfig /tmp/e2e-kubeconfig2940028556 --wait-providers --core cluster-api:v0.4.8 --bootstrap kubeadm:v0.4.8 --control-plane kubeadm:v0.4.8 --infrastructure docker:v0.4.8
  INFO: Waiting for provider controllers to be running
  STEP: Waiting for deployment capd-system/capd-controller-manager to be available @ 04/24/24 08:04:31.768
  INFO: Creating log watcher for controller capd-system/capd-controller-manager, pod capd-controller-manager-7cb759f76b-whdwb, container manager
  STEP: Waiting for deployment capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager to be available @ 04/24/24 08:04:31.892
  INFO: Creating log watcher for controller capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager, pod capi-kubeadm-bootstrap-controller-manager-b67d5f4cb-8kppj, container manager
  STEP: Waiting for deployment capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager to be available @ 04/24/24 08:04:31.918
  INFO: Creating log watcher for controller capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager, pod capi-kubeadm-control-plane-controller-manager-69846d766d-k8ptj, container manager
  STEP: Waiting for deployment capi-system/capi-controller-manager to be available @ 04/24/24 08:04:31.95
  INFO: Creating log watcher for controller capi-system/capi-controller-manager, pod capi-controller-manager-7c9ccb586-5mkbx, container manager
  STEP: THE MANAGEMENT CLUSTER WITH THE OLDER VERSION OF PROVIDERS IS UP&RUNNING! @ 04/24/24 08:04:32.176
  STEP: Creating a namespace for hosting the clusterctl-upgrade test workload cluster @ 04/24/24 08:04:32.177
  INFO: Creating namespace clusterctl-upgrade
  INFO: Creating event watcher for namespace "clusterctl-upgrade"
  STEP: Creating a test workload cluster @ 04/24/24 08:04:32.193
  INFO: Creating the workload cluster with name "clusterctl-upgrade-o3zf09" using the "(default)" template (Kubernetes v1.23.17, 1 control-plane machines, 1 worker machines)
  INFO: Getting the cluster template yaml
  INFO: clusterctl config cluster clusterctl-upgrade-o3zf09 --infrastructure docker --kubernetes-version v1.23.17 --control-plane-machine-count 1 --worker-machine-count 1 --flavor (default)
  INFO: Applying the cluster template yaml to the cluster
  STEP: Waiting for the machines to exist @ 04/24/24 08:04:44.941
  [FAILED] in [It] - /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/clusterctl_upgrade.go:456 @ 04/24/24 08:09:44.953
  STEP: Dumping logs from the "clusterctl-upgrade-hs16jr" workload cluster @ 04/24/24 08:09:44.953
  [FAILED] in [AfterEach] - /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/cluster_proxy.go:311 @ 04/24/24 08:12:44.955
  << Timeline

  [FAILED] Timed out after 300.001s.
  Timed out waiting for all Machines to exist
  Expected
      <int64>: 0
  to equal
      <int64>: 2
  In [It] at: /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/clusterctl_upgrade.go:456 @ 04/24/24 08:09:44.953

  Full Stack Trace
    sigs.k8s.io/cluster-api/test/e2e.ClusterctlUpgradeSpec.func2()
        /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/clusterctl_upgrade.go:456 +0x28cd

pravarag · 2024-05-24T05:04:21Z

I'll investigate more on this issue.
/assign

k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 2, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 3, 2024

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 11, 2024

This was referenced Apr 16, 2024

🌱 Add MP back to dualstack E2E test #10135

Merged

🐛 CAPD: verify lb config after writing it #10453

Merged

fabriziopandini added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label May 6, 2024

k8s-ci-robot assigned pravarag May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No Control Plane machines came into existence. #10356

No Control Plane machines came into existence. #10356

adilGhaffarDev commented Apr 2, 2024

sbueringer commented Apr 3, 2024

sbueringer commented Apr 3, 2024

fabriziopandini commented Apr 11, 2024

chrischdi commented Apr 12, 2024

chrischdi commented Apr 16, 2024

chrischdi commented Apr 17, 2024 •

edited

chrischdi commented Apr 18, 2024

chrischdi commented Apr 23, 2024

chrischdi commented Apr 24, 2024

pravarag commented May 24, 2024

No Control Plane machines came into existence. #10356

No Control Plane machines came into existence. #10356

Comments

adilGhaffarDev commented Apr 2, 2024

Which jobs are flaking?

Which tests are flaking?

Since when has it been flaking?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Label(s) to be applied

sbueringer commented Apr 3, 2024

sbueringer commented Apr 3, 2024

fabriziopandini commented Apr 11, 2024

chrischdi commented Apr 12, 2024

chrischdi commented Apr 16, 2024

chrischdi commented Apr 17, 2024 • edited

chrischdi commented Apr 18, 2024

chrischdi commented Apr 23, 2024

chrischdi commented Apr 24, 2024

pravarag commented May 24, 2024

chrischdi commented Apr 17, 2024 •

edited