Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO_NOT_MERGE][DO_NOT_REVIEW] PR capi node labels fix test #3378

Closed

Conversation

ykakarap
Copy link
Contributor

What type of PR is this?

This PR is to test a CAPI patch: kubernetes-sigs/cluster-api#8427

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:


@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 30, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mboersma for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ykakarap
Copy link
Contributor Author

/test ?

@k8s-ci-robot
Copy link
Contributor

@ykakarap: The following commands are available to trigger required jobs:

  • /test pull-cluster-api-provider-azure-build
  • /test pull-cluster-api-provider-azure-ci-entrypoint
  • /test pull-cluster-api-provider-azure-e2e
  • /test pull-cluster-api-provider-azure-e2e-aks
  • /test pull-cluster-api-provider-azure-test
  • /test pull-cluster-api-provider-azure-verify

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-provider-azure-apidiff
  • /test pull-cluster-api-provider-azure-apiversion-upgrade
  • /test pull-cluster-api-provider-azure-capi-e2e
  • /test pull-cluster-api-provider-azure-conformance
  • /test pull-cluster-api-provider-azure-conformance-with-ci-artifacts
  • /test pull-cluster-api-provider-azure-e2e-optional
  • /test pull-cluster-api-provider-azure-e2e-workload-upgrade
  • /test pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts
  • /test pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts-serial-slow

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-provider-azure-apidiff
  • pull-cluster-api-provider-azure-apiversion-upgrade
  • pull-cluster-api-provider-azure-build
  • pull-cluster-api-provider-azure-ci-entrypoint
  • pull-cluster-api-provider-azure-conformance
  • pull-cluster-api-provider-azure-conformance-with-ci-artifacts
  • pull-cluster-api-provider-azure-e2e
  • pull-cluster-api-provider-azure-e2e-aks
  • pull-cluster-api-provider-azure-test
  • pull-cluster-api-provider-azure-verify
  • pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ykakarap
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional

@ykakarap
Copy link
Contributor Author

Found a bug in the patch. Fixing it now and re-testing.

@ykakarap
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e

@ykakarap
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional

@ykakarap
Copy link
Contributor Author

The pull-cluster-api-provider-azure-e2e-optional job is stuck in pending for a long time. I will rebase and force push to see if that will restart the job.

@ykakarap ykakarap force-pushed the pr-capi-node-labels-fix-test branch from 487f5dc to a0e2c86 Compare March 31, 2023 01:05
@ykakarap
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional

@ykakarap
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e

1 similar comment
@ykakarap
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e

@CecileRobertMichon
Copy link
Contributor

@ykakarap @fabriziopandini the MachinePool job seems to be more flaky than usual in this PR. I looked at one of the failures (https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/3378/pull-cluster-api-provider-azure-e2e/1641831314409656320) and the flake seems like it's related to the uninitialized node taint:

  STEP: creating an HTTP deployment @ 03/31/23 16:16:36.947
  STEP: waiting for deployment default/web3cvttl to be available @ 03/31/23 16:16:37.066
  Mar 31 16:16:37.066: INFO: starting to wait for deployment to become available
  [FAILED] in [It] - /home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/helpers.go:121 @ 03/31/23 16:31:37.175
  Mar 31 16:31:37.175: INFO: FAILED!

In the pod describe in artifacts I see:

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  11m (x4 over 26m)  default-scheduler  0/5 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 2 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {node.cluster.x-k8s.io/uninitialized: }. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.
  Normal   Scheduled         10m                default-scheduler  Successfully assigned default/web3cvttl-8488c67649-5rf86 to capz-e2e-ri4hnd-vmss-mp-0000001
  Normal   Pulling           10m                kubelet            Pulling image "httpd"
  Normal   Pulled            10m                kubelet            Successfully pulled image "httpd" in 8.682210475s (10.39117994s including waiting)
  Normal   Created           10m                kubelet            Created container web3cvttl
  Normal   Started           10m                kubelet            Started container web3cvttl

and

Start Time:       Fri, 31 Mar 2023 16:32:31 +0000

So it seems it took a full 16 minutes for the pod to start, and the test timed out right before that after 15 minutes. This part Warning FailedScheduling 11m (x4 over 26m) seems to indicate there is a delay scheduling the pod due to the taints which wasn't there before. Is this expected? Is there something we can do to remove this test flake in our tests without increasing the timeout beyond 15 minutes (15 minutes should already be plenty...). Maybe wait for nodes to be initialized before entering that part of the test suite? In practice, does this mean we expect 15 minute delays for users to be able to schedule pods when a cluster comes up with CAPI 1.4?

@ykakarap
Copy link
Contributor Author

ykakarap commented Mar 31, 2023

The machine should drop the taint from the node almost immediately after infra provider reports .spec.providerID.
Could it be possible that the scheduler is going into some exponential back off if it cannot find a "good node" and the scheduler is not re-evaluating nodes soon enough after the taint is dropped?

Maybe wait for nodes to be initialized before entering that part of the test suite?

Sounds like a good idea. In such cases the pods should be scheduled almost immediately as the taint should already be dropped at this point.

@CecileRobertMichon
Copy link
Contributor

@ykakarap I think the issue might be that there's a delay in MachinePools because the MachinePool controller is not watching cluster nodes, unlike the Machine controller.

@ykakarap
Copy link
Contributor Author

@CecileRobertMichon I think you are right. There are not enough events that trigger the machinepool reconciler. It is only watching for changes on Cluster and MachinePool as opposed to Machine controller which is reconciled a lot more (as it also reconciles when there is a change on the Node).

I was not able to force any delay in dropping the taints (tested locally) when using Machines, even when I delay the .spec.providerID form being set. So, I think it is not an exponential back-off problem.

Can we move the "slow MachinePool reconciliation" into a separate issue and get the current fix for syncing node labels merged?

@CecileRobertMichon
Copy link
Contributor

CecileRobertMichon commented Mar 31, 2023

Can we move the "slow MachinePool reconciliation" into a separate issue and get the current fix for syncing node labels merged?

yes, I'm working on a fix

edit: opened kubernetes-sigs/cluster-api#8442 to track

@CecileRobertMichon
Copy link
Contributor

/test pull-cluster-api-provider-azure-e2e

@CecileRobertMichon
Copy link
Contributor

pass 1/1

/test pull-cluster-api-provider-azure-e2e

@CecileRobertMichon
Copy link
Contributor

/test pull-cluster-api-provider-azure-e2e

1 similar comment
@CecileRobertMichon
Copy link
Contributor

/test pull-cluster-api-provider-azure-e2e

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Apr 3, 2023

@ykakarap: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-azure-e2e-optional a0e2c86 link false /test pull-cluster-api-provider-azure-e2e-optional
pull-cluster-api-provider-azure-test b803720 link true /test pull-cluster-api-provider-azure-test
pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts b803720 link false /test pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@CecileRobertMichon
Copy link
Contributor

/test pull-cluster-api-provider-azure-e2e

@CecileRobertMichon
Copy link
Contributor

pass 1/1

/test pull-cluster-api-provider-azure-e2e

@CecileRobertMichon
Copy link
Contributor

pass 2/2

/test pull-cluster-api-provider-azure-e2e

@CecileRobertMichon
Copy link
Contributor

pass 3/3

/test pull-cluster-api-provider-azure-e2e

@CecileRobertMichon
Copy link
Contributor

pass 4/4

/test pull-cluster-api-provider-azure-e2e

@CecileRobertMichon
Copy link
Contributor

5/5 passes 🎉

@CecileRobertMichon
Copy link
Contributor

Test with both node watcher + node patching fix

/test pull-cluster-api-provider-azure-e2e

@CecileRobertMichon
Copy link
Contributor

pass 1/1

/test pull-cluster-api-provider-azure-e2e

@CecileRobertMichon
Copy link
Contributor

pass 2/2

/test pull-cluster-api-provider-azure-e2e

@CecileRobertMichon
Copy link
Contributor

pass 3/3

/test pull-cluster-api-provider-azure-e2e

@CecileRobertMichon
Copy link
Contributor

/close

@k8s-ci-robot
Copy link
Contributor

@CecileRobertMichon: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants