Add watch on cluster.status.infrastructureReady to reconcile vsphereMachine when cluster infrastructure is ready #1212

mrajashree · 2021-07-13T23:42:46Z

What this PR does / why we need it:
After clusterctl move for self-managed CAPV clusters, worker machines take around 10-15 minutes to go to "Running" phase.
It could be due to the following reasons:

During clusterctl move, the Status fields are removed. So none of the CAPI objects retain the status fields, including the Cluster object. Once the Cluster.Spec.Paused field is unset after move, controllers begin to reconcile and the status fields get set again.
The capv vsphereMachine controller while reconciling the VsphereMachine object checks for the Cluster.Status.InfrastructureReady field before proceeding, which will be false immediately after the move until the cluster controller has reconciled & updated it.
The vsphereMachine controller gets the Cluster object at the start of every Reconcile, and it's possible that the CAPI cluster controller updates the Cluster object after it has already been picked up by the vsphereMachine controller. It's also possible that CAPV could be hitting this bug where controller-runtime GET is reading stale data. And from the comments on that bug the recommended way is to keep retrying.
CAPV syncs with CAPI objects eventually after the sync period which is by default set to 10 minutes. Because of which, after around 10 minutes the vsphereMachine controller gets the latest copy of the Cluster object and reconciles VsphereMachine successfully because by then the Cluster.Status.InfrastructureReady field has been set to true.
So the solution is either reducing the sync period time so that CAPV gets the latest Cluster object sooner or requeuing VsphereMachine until Cluster.Status.InfrastructureReady is true. It is recommended to requeue the object instead of reducing the sync period in this case as per these comments in the controller-runtime code.
This PR modifies the watch on Cluster object, to receive update events when cluster.status.infrastructureReady field becomes true. This way it enqueues the vsphereMachine for reconcilation once infrastructure is ready post clusterctl move

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1211

Release note:

NONE

k8s-ci-robot · 2021-07-13T23:42:54Z

Welcome @mrajashree!

It looks like this is your first PR to kubernetes-sigs/cluster-api-provider-vsphere 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api-provider-vsphere has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2021-07-13T23:42:54Z

Hi @mrajashree. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

neolit123 · 2021-07-14T19:00:21Z

deferring to the CAPV maintainers:
/assign yastij randomvariable

neolit123 · 2021-07-14T19:00:30Z

/ok-to-test

vincepri · 2021-07-14T20:07:48Z

controllers/vspheremachine_controller.go

 		conditions.MarkFalse(ctx.VSphereMachine, infrav1.VMProvisionedCondition, infrav1.WaitingForClusterInfrastructureReason, clusterv1.ConditionSeverityInfo, "")
-		return reconcile.Result{}, nil
+		return reconcile.Result{RequeueAfter: 30*time.Second}, nil


Should we have a watch here instead of requeueing?

@vincepri oh you mean just like there's a watch on Cluster for the VSphereCluster object, we should add one mapping Cluster to VSphereMachine?

Yeah if we're waiting for it, we probably should add watch instead of using RequeueAfter

oh looks like there is a watch for Cluster, just need to update the predicate to check for this

I tested changing the update predicate for the watch on Cluster but I see the same, machines reconcile after a while. I'll see if I missed something and test again. Only requeuing seems to work

The watch on cluster object's infrastructureReady field change wasn't working because it wasn't enqueuing the vsphereMachine object. Opened a PR for that #1306
With the change in clusterToVsphereMachine mapper, updating the watch on cluster object to enqueue vsphereMachine when cluster.status.infrastructureReady becomes true works. I'll rebase this PR when the other one is merged

srm09 · 2021-07-19T20:49:30Z

Retriggering, seems to be an unrelated failure

srm09 · 2021-07-19T20:49:34Z

/test pull-cluster-api-provider-vsphere-e2e

yastij · 2021-07-22T17:10:29Z

/assign @srm09

yastij · 2021-07-26T12:30:17Z

/retest

k8s-triage-robot · 2021-10-24T13:31:46Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-ci-robot · 2021-11-13T00:46:24Z

@mrajashree: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-provider-vsphere-verify-shell	`a4156ff`	link	true	`/test pull-cluster-api-provider-vsphere-verify-shell`
pull-cluster-api-provider-vsphere-verify-markdown	`a4156ff`	link	true	`/test pull-cluster-api-provider-vsphere-verify-markdown`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

mrajashree · 2021-11-15T19:52:11Z

Updated the PR to replace predicates on cluster watch with ClusterUnpausedAndInfrastructureReady
This covers create events when Cluster.Spec.Paused is false and Cluster.Status.InfrastructureReady is true
And covers update events when either Cluster.Spec.Paused transitions to false or Cluster.Status.InfrastructureReady transitions to true

srm09 · 2021-11-15T20:20:53Z

/lgtm

mrajashree · 2021-11-17T17:20:09Z

/remove-lifecycle stale

srm09 · 2021-11-30T00:44:21Z

/lgtm

srm09 · 2021-11-30T00:50:27Z

/approve

k8s-ci-robot · 2021-11-30T00:50:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: srm09

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [srm09]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 13, 2021

k8s-ci-robot requested review from justinsb and vincepri July 13, 2021 23:42

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jul 13, 2021

k8s-ci-robot assigned randomvariable and yastij Jul 14, 2021

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 14, 2021

vincepri reviewed Jul 14, 2021

View reviewed changes

mrajashree force-pushed the requeue branch from bde62f6 to e6d1b62 Compare July 14, 2021 21:20

mrajashree requested a review from vincepri July 14, 2021 21:49

mrajashree mentioned this pull request Jul 15, 2021

REQUEST: New membership for @mrajashree kubernetes/org#2833

Closed

6 tasks

mrajashree changed the title ~~Requeue vsphereMachine if cluster infrastructure is not ready~~ [WIP] Requeue vsphereMachine if cluster infrastructure is not ready Jul 19, 2021

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 19, 2021

k8s-ci-robot assigned srm09 Jul 22, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 24, 2021

mrajashree force-pushed the requeue branch from 96226a7 to a4156ff Compare November 13, 2021 00:39

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 13, 2021

mrajashree changed the base branch from release-0.7 to master November 13, 2021 00:40

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 13, 2021

mrajashree force-pushed the requeue branch from a4156ff to def5fb0 Compare November 13, 2021 00:41

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 13, 2021

mrajashree changed the title ~~[WIP] Requeue vsphereMachine if cluster infrastructure is not ready~~ Requeue vsphereMachine if cluster infrastructure is not ready Nov 13, 2021

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 13, 2021

mrajashree force-pushed the requeue branch 2 times, most recently from ede6c0a to 57ce67e Compare November 15, 2021 19:43

mrajashree changed the title ~~Requeue vsphereMachine if cluster infrastructure is not ready~~ Add watch on cluster.status.infrastructureReady to reconcile vsphereMachine when cluster infrastructure is ready Nov 15, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 15, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 17, 2021

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 26, 2021

mrajashree force-pushed the requeue branch from 57ce67e to 4e8877d Compare November 26, 2021 19:38

k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 26, 2021

Watch Cluster object for updates to the InfrastructureReady field

73dd891

mrajashree force-pushed the requeue branch from 4e8877d to 73dd891 Compare November 26, 2021 19:38

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 30, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 30, 2021

k8s-ci-robot merged commit 737850c into kubernetes-sigs:master Nov 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add watch on cluster.status.infrastructureReady to reconcile vsphereMachine when cluster infrastructure is ready #1212

Add watch on cluster.status.infrastructureReady to reconcile vsphereMachine when cluster infrastructure is ready #1212

mrajashree commented Jul 13, 2021 •

edited

k8s-ci-robot commented Jul 13, 2021

k8s-ci-robot commented Jul 13, 2021

neolit123 commented Jul 14, 2021

neolit123 commented Jul 14, 2021

vincepri Jul 14, 2021

mrajashree Jul 14, 2021 •

edited

vincepri Jul 14, 2021

mrajashree Jul 15, 2021

mrajashree Jul 20, 2021 •

edited

mrajashree Nov 13, 2021

srm09 commented Jul 19, 2021

srm09 commented Jul 19, 2021

yastij commented Jul 22, 2021

yastij commented Jul 26, 2021

k8s-triage-robot commented Oct 24, 2021

k8s-ci-robot commented Nov 13, 2021 •

edited

mrajashree commented Nov 15, 2021

srm09 commented Nov 15, 2021

mrajashree commented Nov 17, 2021

srm09 commented Nov 30, 2021

srm09 commented Nov 30, 2021

k8s-ci-robot commented Nov 30, 2021

Add watch on cluster.status.infrastructureReady to reconcile vsphereMachine when cluster infrastructure is ready #1212

Add watch on cluster.status.infrastructureReady to reconcile vsphereMachine when cluster infrastructure is ready #1212

Conversation

mrajashree commented Jul 13, 2021 • edited

k8s-ci-robot commented Jul 13, 2021

k8s-ci-robot commented Jul 13, 2021

neolit123 commented Jul 14, 2021

neolit123 commented Jul 14, 2021

vincepri Jul 14, 2021

Choose a reason for hiding this comment

mrajashree Jul 14, 2021 • edited

Choose a reason for hiding this comment

vincepri Jul 14, 2021

Choose a reason for hiding this comment

mrajashree Jul 15, 2021

Choose a reason for hiding this comment

mrajashree Jul 20, 2021 • edited

Choose a reason for hiding this comment

mrajashree Nov 13, 2021

Choose a reason for hiding this comment

srm09 commented Jul 19, 2021

srm09 commented Jul 19, 2021

yastij commented Jul 22, 2021

yastij commented Jul 26, 2021

k8s-triage-robot commented Oct 24, 2021

k8s-ci-robot commented Nov 13, 2021 • edited

mrajashree commented Nov 15, 2021

srm09 commented Nov 15, 2021

mrajashree commented Nov 17, 2021

srm09 commented Nov 30, 2021

srm09 commented Nov 30, 2021

k8s-ci-robot commented Nov 30, 2021

mrajashree commented Jul 13, 2021 •

edited

mrajashree Jul 14, 2021 •

edited

mrajashree Jul 20, 2021 •

edited

k8s-ci-robot commented Nov 13, 2021 •

edited