[DO_NOT_MERGE][DO_NOT_REVIEW] PR capi node labels fix test #3378

ykakarap · 2023-03-30T18:45:06Z

What type of PR is this?

This PR is to test a CAPI patch: kubernetes-sigs/cluster-api#8427

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

squashed commits
includes documentation
adds unit tests

Release note:

k8s-ci-robot · 2023-03-30T18:45:08Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2023-03-30T18:45:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mboersma for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ykakarap · 2023-03-30T18:46:51Z

/test ?

k8s-ci-robot · 2023-03-30T18:46:53Z

@ykakarap: The following commands are available to trigger required jobs:

/test pull-cluster-api-provider-azure-build
/test pull-cluster-api-provider-azure-ci-entrypoint
/test pull-cluster-api-provider-azure-e2e
/test pull-cluster-api-provider-azure-e2e-aks
/test pull-cluster-api-provider-azure-test
/test pull-cluster-api-provider-azure-verify

The following commands are available to trigger optional jobs:

/test pull-cluster-api-provider-azure-apidiff
/test pull-cluster-api-provider-azure-apiversion-upgrade
/test pull-cluster-api-provider-azure-capi-e2e
/test pull-cluster-api-provider-azure-conformance
/test pull-cluster-api-provider-azure-conformance-with-ci-artifacts
/test pull-cluster-api-provider-azure-e2e-optional
/test pull-cluster-api-provider-azure-e2e-workload-upgrade
/test pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts
/test pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts-serial-slow

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-provider-azure-apidiff
pull-cluster-api-provider-azure-apiversion-upgrade
pull-cluster-api-provider-azure-build
pull-cluster-api-provider-azure-ci-entrypoint
pull-cluster-api-provider-azure-conformance
pull-cluster-api-provider-azure-conformance-with-ci-artifacts
pull-cluster-api-provider-azure-e2e
pull-cluster-api-provider-azure-e2e-aks
pull-cluster-api-provider-azure-test
pull-cluster-api-provider-azure-verify
pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ykakarap · 2023-03-30T18:47:33Z

/test pull-cluster-api-provider-azure-e2e-optional

ykakarap · 2023-03-30T20:19:30Z

Found a bug in the patch. Fixing it now and re-testing.

ykakarap · 2023-03-30T20:20:06Z

/test pull-cluster-api-provider-azure-e2e

ykakarap · 2023-03-30T21:54:11Z

/test pull-cluster-api-provider-azure-e2e-optional

ykakarap · 2023-03-31T01:05:07Z

The pull-cluster-api-provider-azure-e2e-optional job is stuck in pending for a long time. I will rebase and force push to see if that will restart the job.

ykakarap · 2023-03-31T01:06:07Z

/test pull-cluster-api-provider-azure-e2e-optional

ykakarap · 2023-03-31T15:54:21Z

/test pull-cluster-api-provider-azure-e2e

ykakarap · 2023-03-31T18:00:18Z

/test pull-cluster-api-provider-azure-e2e

CecileRobertMichon · 2023-03-31T19:00:13Z

@ykakarap @fabriziopandini the MachinePool job seems to be more flaky than usual in this PR. I looked at one of the failures (https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/3378/pull-cluster-api-provider-azure-e2e/1641831314409656320) and the flake seems like it's related to the uninitialized node taint:

  STEP: creating an HTTP deployment @ 03/31/23 16:16:36.947
  STEP: waiting for deployment default/web3cvttl to be available @ 03/31/23 16:16:37.066
  Mar 31 16:16:37.066: INFO: starting to wait for deployment to become available
  [FAILED] in [It] - /home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/helpers.go:121 @ 03/31/23 16:31:37.175
  Mar 31 16:31:37.175: INFO: FAILED!

In the pod describe in artifacts I see:

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  11m (x4 over 26m)  default-scheduler  0/5 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 2 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {node.cluster.x-k8s.io/uninitialized: }. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.
  Normal   Scheduled         10m                default-scheduler  Successfully assigned default/web3cvttl-8488c67649-5rf86 to capz-e2e-ri4hnd-vmss-mp-0000001
  Normal   Pulling           10m                kubelet            Pulling image "httpd"
  Normal   Pulled            10m                kubelet            Successfully pulled image "httpd" in 8.682210475s (10.39117994s including waiting)
  Normal   Created           10m                kubelet            Created container web3cvttl
  Normal   Started           10m                kubelet            Started container web3cvttl

and

Start Time:       Fri, 31 Mar 2023 16:32:31 +0000

So it seems it took a full 16 minutes for the pod to start, and the test timed out right before that after 15 minutes. This part Warning FailedScheduling 11m (x4 over 26m) seems to indicate there is a delay scheduling the pod due to the taints which wasn't there before. Is this expected? Is there something we can do to remove this test flake in our tests without increasing the timeout beyond 15 minutes (15 minutes should already be plenty...). Maybe wait for nodes to be initialized before entering that part of the test suite? In practice, does this mean we expect 15 minute delays for users to be able to schedule pods when a cluster comes up with CAPI 1.4?

ykakarap · 2023-03-31T19:43:29Z

The machine should drop the taint from the node almost immediately after infra provider reports .spec.providerID.
Could it be possible that the scheduler is going into some exponential back off if it cannot find a "good node" and the scheduler is not re-evaluating nodes soon enough after the taint is dropped?

Maybe wait for nodes to be initialized before entering that part of the test suite?

Sounds like a good idea. In such cases the pods should be scheduled almost immediately as the taint should already be dropped at this point.

CecileRobertMichon · 2023-03-31T21:01:44Z

@ykakarap I think the issue might be that there's a delay in MachinePools because the MachinePool controller is not watching cluster nodes, unlike the Machine controller.

ykakarap · 2023-03-31T21:41:49Z

@CecileRobertMichon I think you are right. There are not enough events that trigger the machinepool reconciler. It is only watching for changes on Cluster and MachinePool as opposed to Machine controller which is reconciled a lot more (as it also reconciles when there is a change on the Node).

I was not able to force any delay in dropping the taints (tested locally) when using Machines, even when I delay the .spec.providerID form being set. So, I think it is not an exponential back-off problem.

Can we move the "slow MachinePool reconciliation" into a separate issue and get the current fix for syncing node labels merged?

CecileRobertMichon · 2023-03-31T21:55:55Z

Can we move the "slow MachinePool reconciliation" into a separate issue and get the current fix for syncing node labels merged?

yes, I'm working on a fix

edit: opened kubernetes-sigs/cluster-api#8442 to track

CecileRobertMichon · 2023-04-01T02:09:17Z

/test pull-cluster-api-provider-azure-e2e

CecileRobertMichon · 2023-04-01T06:27:23Z

pass 1/1

/test pull-cluster-api-provider-azure-e2e

CecileRobertMichon · 2023-04-03T17:42:38Z

/test pull-cluster-api-provider-azure-e2e

CecileRobertMichon · 2023-04-03T18:52:07Z

/test pull-cluster-api-provider-azure-e2e

k8s-ci-robot · 2023-04-03T21:20:58Z

@ykakarap: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-provider-azure-e2e-optional	`a0e2c86`	link	false	`/test pull-cluster-api-provider-azure-e2e-optional`
pull-cluster-api-provider-azure-test	`b803720`	link	true	`/test pull-cluster-api-provider-azure-test`
pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts	`b803720`	link	false	`/test pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

CecileRobertMichon · 2023-04-03T22:16:24Z

/test pull-cluster-api-provider-azure-e2e

CecileRobertMichon · 2023-04-03T23:22:33Z

pass 1/1

/test pull-cluster-api-provider-azure-e2e

CecileRobertMichon · 2023-04-04T01:25:01Z

pass 2/2

/test pull-cluster-api-provider-azure-e2e

CecileRobertMichon · 2023-04-04T03:19:43Z

pass 3/3

/test pull-cluster-api-provider-azure-e2e

CecileRobertMichon · 2023-04-04T05:28:56Z

pass 4/4

/test pull-cluster-api-provider-azure-e2e

CecileRobertMichon · 2023-04-04T16:28:29Z

5/5 passes 🎉

CecileRobertMichon · 2023-04-04T17:34:55Z

Test with both node watcher + node patching fix

/test pull-cluster-api-provider-azure-e2e

CecileRobertMichon · 2023-04-04T18:47:23Z

pass 1/1

/test pull-cluster-api-provider-azure-e2e

CecileRobertMichon · 2023-04-04T20:15:25Z

pass 2/2

/test pull-cluster-api-provider-azure-e2e

CecileRobertMichon · 2023-04-04T21:37:17Z

pass 3/3

/test pull-cluster-api-provider-azure-e2e

CecileRobertMichon · 2023-04-04T23:24:35Z

/close

k8s-ci-robot · 2023-04-04T23:24:39Z

@CecileRobertMichon: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

willie-yao and others added 6 commits March 29, 2023 00:02

Test CAPI with v1.4.0-rc.1

44aef49

Use uninitialized taint fix

26060ec

Update to 1.4.0

9afd622

Manually set WindowsHostProcessContainers flag

c084b51

Remove usage of pointer.StringPtr

449f293

pull patched CAPI image for testing

5f93999

k8s-ci-robot requested review from jackfrancis and mboersma March 30, 2023 18:45

also replace preloaded images

a0e2c86

ykakarap force-pushed the pr-capi-node-labels-fix-test branch from 487f5dc to a0e2c86 Compare March 31, 2023 01:05

CecileRobertMichon mentioned this pull request Mar 31, 2023

MachinePool controller does not watch nodes, causing delay in removing uninitialized node taint kubernetes-sigs/cluster-api#8442

Closed

Use my MP node watch branch

b44415f

CecileRobertMichon mentioned this pull request Mar 31, 2023

🐛 Add node watcher to MachinePool controller kubernetes-sigs/cluster-api#8443

Merged

Make every test but VMSS optional

b803720

CecileRobertMichon mentioned this pull request Apr 3, 2023

🐛 Fix MachinePool node taint patching kubernetes-sigs/cluster-api#8462

Merged

k8s-ci-robot closed this Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO_NOT_MERGE][DO_NOT_REVIEW] PR capi node labels fix test #3378

[DO_NOT_MERGE][DO_NOT_REVIEW] PR capi node labels fix test #3378

ykakarap commented Mar 30, 2023

k8s-ci-robot commented Mar 30, 2023

k8s-ci-robot commented Mar 30, 2023

ykakarap commented Mar 30, 2023

k8s-ci-robot commented Mar 30, 2023

ykakarap commented Mar 30, 2023

ykakarap commented Mar 30, 2023

ykakarap commented Mar 30, 2023

ykakarap commented Mar 30, 2023

ykakarap commented Mar 31, 2023

ykakarap commented Mar 31, 2023

ykakarap commented Mar 31, 2023

ykakarap commented Mar 31, 2023

CecileRobertMichon commented Mar 31, 2023

ykakarap commented Mar 31, 2023 •

edited

CecileRobertMichon commented Mar 31, 2023

ykakarap commented Mar 31, 2023

CecileRobertMichon commented Mar 31, 2023 •

edited

CecileRobertMichon commented Apr 1, 2023

CecileRobertMichon commented Apr 1, 2023

CecileRobertMichon commented Apr 3, 2023

CecileRobertMichon commented Apr 3, 2023

k8s-ci-robot commented Apr 3, 2023 •

edited

CecileRobertMichon commented Apr 3, 2023

CecileRobertMichon commented Apr 3, 2023

CecileRobertMichon commented Apr 4, 2023

CecileRobertMichon commented Apr 4, 2023

CecileRobertMichon commented Apr 4, 2023

CecileRobertMichon commented Apr 4, 2023

CecileRobertMichon commented Apr 4, 2023

CecileRobertMichon commented Apr 4, 2023

CecileRobertMichon commented Apr 4, 2023

CecileRobertMichon commented Apr 4, 2023

CecileRobertMichon commented Apr 4, 2023

k8s-ci-robot commented Apr 4, 2023

[DO_NOT_MERGE][DO_NOT_REVIEW] PR capi node labels fix test #3378

[DO_NOT_MERGE][DO_NOT_REVIEW] PR capi node labels fix test #3378

Conversation

ykakarap commented Mar 30, 2023

k8s-ci-robot commented Mar 30, 2023

k8s-ci-robot commented Mar 30, 2023

ykakarap commented Mar 30, 2023

k8s-ci-robot commented Mar 30, 2023

ykakarap commented Mar 30, 2023

ykakarap commented Mar 30, 2023

ykakarap commented Mar 30, 2023

ykakarap commented Mar 30, 2023

ykakarap commented Mar 31, 2023

ykakarap commented Mar 31, 2023

ykakarap commented Mar 31, 2023

ykakarap commented Mar 31, 2023

CecileRobertMichon commented Mar 31, 2023

ykakarap commented Mar 31, 2023 • edited

CecileRobertMichon commented Mar 31, 2023

ykakarap commented Mar 31, 2023

CecileRobertMichon commented Mar 31, 2023 • edited

CecileRobertMichon commented Apr 1, 2023

CecileRobertMichon commented Apr 1, 2023

CecileRobertMichon commented Apr 3, 2023

CecileRobertMichon commented Apr 3, 2023

k8s-ci-robot commented Apr 3, 2023 • edited

CecileRobertMichon commented Apr 3, 2023

CecileRobertMichon commented Apr 3, 2023

CecileRobertMichon commented Apr 4, 2023

CecileRobertMichon commented Apr 4, 2023

CecileRobertMichon commented Apr 4, 2023

CecileRobertMichon commented Apr 4, 2023

CecileRobertMichon commented Apr 4, 2023

CecileRobertMichon commented Apr 4, 2023

CecileRobertMichon commented Apr 4, 2023

CecileRobertMichon commented Apr 4, 2023

CecileRobertMichon commented Apr 4, 2023

k8s-ci-robot commented Apr 4, 2023

ykakarap commented Mar 31, 2023 •

edited

CecileRobertMichon commented Mar 31, 2023 •

edited

k8s-ci-robot commented Apr 3, 2023 •

edited