Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No Control Plane machines came into existence. #10356

Open
adilGhaffarDev opened this issue Apr 2, 2024 · 10 comments
Open

No Control Plane machines came into existence. #10356

adilGhaffarDev opened this issue Apr 2, 2024 · 10 comments
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@adilGhaffarDev
Copy link
Contributor

Which jobs are flaking?

  • periodic-cluster-api-e2e-main
  • periodic-cluster-api-e2e-mink8s-main
  • periodic-cluster-api-e2e-dualstack-and-ipv6-release-1-6
  • periodic-cluster-api-e2e-mink8s-main
  • periodic-cluster-api-e2e-release-1-4

Which tests are flaking?

  • When following the Cluster API quick-start with Ignition Should create a workload cluster
  • When upgrading a workload cluster using ClusterClass with a HA control plane [ClusterClass] Should create and upgrade a workload cluster and eventually run kubetest
  • When testing MachineDeployment scale out/in Should successfully scale a MachineDeployment up and down upon changes to the MachineDeployment replica count
  • When testing clusterctl upgrades using ClusterClass (v1.5=>current) [ClusterClass] Should create a management cluster and then upgrade all the providers
  • When testing ClusterClass changes [ClusterClass] Should successfully rollout the managed topology upon changes to the ClusterClass

Since when has it been flaking?

Minor flakes with this error have been happening for a long time.

Testgrid link

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main

Reason for failure (if possible)

No response

Anything else we need to know?

No response

Label(s) to be applied

/kind flake
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 2, 2024
@sbueringer
Copy link
Member

/triage accepted

Thx for reporting

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 3, 2024
@sbueringer
Copy link
Member

Would be good to add a link either to a specific failed job or k8s-triage filtered down on this failure.

Just to make it easier to find a failed job

@fabriziopandini
Copy link
Member

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 11, 2024
@chrischdi
Copy link
Member

k8s-triage link

@chrischdi
Copy link
Member

I did hit this issue in a local setup. However, its quite hard to triage, because the machine already got replaced by a new one (I guess because of MHC doing its thing) and the cluster successfully started then.

I still have stuff around if there are ideas to filter information.

@chrischdi
Copy link
Member

chrischdi commented Apr 17, 2024

I was able to hit it again and triage a bit.

It turns out that the node itself came up, except the parts which try to hit the load balanced control-plane endpoint.

TLDR: The haproxy load balancer did not forward traffic to the control plane node.

My current theory is:

  • CAPD wrote the new haproxy config file to the lb container, which includes the first control plane node as backend
    • I can confirm that this one was correct in my setup by reading the file again using `docker run cp
  • CAPD did signal to haproxy to reload the config (by sending SIGHUP)
  • Afterwards CAPD did still not route requests to the running node.

I was able to "fix" the issue in this case by again sending SIGHUP to haproxy: docker kill -s SIGHUP <container>
Afterwards haproxy did route requests to the running node.

I'm currently testing the following fix locally which is: reading and comparing the configfile in CAPD after writing it and before reloading haproxy:


Test setup:

GINKGO_FOCUS="PR-Blocking"
GINKGO_SKIP="\[Conformance\]"

So it only runs a single test.

I used prowjob on kind to create a kind cluster and pod yaml, which I then modified (adjusted timeouts + GINKGO_FOCUS + requests, propably other things too).

I then run the loop using ./scripts/test-pj.sh.

All code changes for my setup are here for reference: dfe9d5e

I did some optimisations, like packing all required images to scripts/images.tar and loading them from there instead of building + propably some others to make it faster and rely less on internet to not run into rate limiting stuff too.

@chrischdi
Copy link
Member

Fixes are merged, let's check next week or so if the error occurs again.

@chrischdi
Copy link
Member

The merged fix did not help.

@chrischdi
Copy link
Member

For reference, I did hit the same issue (CAPD load balancer config not active) as described in this comment on a 0.4 => 1.6 => current upgrade test but with a slightly different log:

  STEP: Initializing the workload cluster with older versions of providers @ 04/24/24 08:03:54.723
  INFO: clusterctl init --config /logs/artifacts/repository/clusterctl-config.v1.2.yaml --kubeconfig /tmp/e2e-kubeconfig2940028556 --wait-providers --core cluster-api:v0.4.8 --bootstrap kubeadm:v0.4.8 --control-plane kubeadm:v0.4.8 --infrastructure docker:v0.4.8
  INFO: Waiting for provider controllers to be running
  STEP: Waiting for deployment capd-system/capd-controller-manager to be available @ 04/24/24 08:04:31.768
  INFO: Creating log watcher for controller capd-system/capd-controller-manager, pod capd-controller-manager-7cb759f76b-whdwb, container manager
  STEP: Waiting for deployment capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager to be available @ 04/24/24 08:04:31.892
  INFO: Creating log watcher for controller capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager, pod capi-kubeadm-bootstrap-controller-manager-b67d5f4cb-8kppj, container manager
  STEP: Waiting for deployment capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager to be available @ 04/24/24 08:04:31.918
  INFO: Creating log watcher for controller capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager, pod capi-kubeadm-control-plane-controller-manager-69846d766d-k8ptj, container manager
  STEP: Waiting for deployment capi-system/capi-controller-manager to be available @ 04/24/24 08:04:31.95
  INFO: Creating log watcher for controller capi-system/capi-controller-manager, pod capi-controller-manager-7c9ccb586-5mkbx, container manager
  STEP: THE MANAGEMENT CLUSTER WITH THE OLDER VERSION OF PROVIDERS IS UP&RUNNING! @ 04/24/24 08:04:32.176
  STEP: Creating a namespace for hosting the clusterctl-upgrade test workload cluster @ 04/24/24 08:04:32.177
  INFO: Creating namespace clusterctl-upgrade
  INFO: Creating event watcher for namespace "clusterctl-upgrade"
  STEP: Creating a test workload cluster @ 04/24/24 08:04:32.193
  INFO: Creating the workload cluster with name "clusterctl-upgrade-o3zf09" using the "(default)" template (Kubernetes v1.23.17, 1 control-plane machines, 1 worker machines)
  INFO: Getting the cluster template yaml
  INFO: clusterctl config cluster clusterctl-upgrade-o3zf09 --infrastructure docker --kubernetes-version v1.23.17 --control-plane-machine-count 1 --worker-machine-count 1 --flavor (default)
  INFO: Applying the cluster template yaml to the cluster
  STEP: Waiting for the machines to exist @ 04/24/24 08:04:44.941
  [FAILED] in [It] - /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/clusterctl_upgrade.go:456 @ 04/24/24 08:09:44.953
  STEP: Dumping logs from the "clusterctl-upgrade-hs16jr" workload cluster @ 04/24/24 08:09:44.953
  [FAILED] in [AfterEach] - /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/cluster_proxy.go:311 @ 04/24/24 08:12:44.955
  << Timeline

  [FAILED] Timed out after 300.001s.
  Timed out waiting for all Machines to exist
  Expected
      <int64>: 0
  to equal
      <int64>: 2
  In [It] at: /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/clusterctl_upgrade.go:456 @ 04/24/24 08:09:44.953

  Full Stack Trace
    sigs.k8s.io/cluster-api/test/e2e.ClusterctlUpgradeSpec.func2()
        /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/clusterctl_upgrade.go:456 +0x28cd

@fabriziopandini fabriziopandini added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label May 6, 2024
@pravarag
Copy link
Contributor

I'll investigate more on this issue.
/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants