Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add watch on cluster.status.infrastructureReady to reconcile vsphereMachine when cluster infrastructure is ready #1212

Merged
merged 1 commit into from Nov 30, 2021

Conversation

mrajashree
Copy link
Contributor

@mrajashree mrajashree commented Jul 13, 2021

What this PR does / why we need it:
After clusterctl move for self-managed CAPV clusters, worker machines take around 10-15 minutes to go to "Running" phase.
It could be due to the following reasons:

  1. During clusterctl move, the Status fields are removed. So none of the CAPI objects retain the status fields, including the Cluster object. Once the Cluster.Spec.Paused field is unset after move, controllers begin to reconcile and the status fields get set again.
  2. The capv vsphereMachine controller while reconciling the VsphereMachine object checks for the Cluster.Status.InfrastructureReady field before proceeding, which will be false immediately after the move until the cluster controller has reconciled & updated it.
  3. The vsphereMachine controller gets the Cluster object at the start of every Reconcile, and it's possible that the CAPI cluster controller updates the Cluster object after it has already been picked up by the vsphereMachine controller. It's also possible that CAPV could be hitting this bug where controller-runtime GET is reading stale data. And from the comments on that bug the recommended way is to keep retrying.
  4. CAPV syncs with CAPI objects eventually after the sync period which is by default set to 10 minutes. Because of which, after around 10 minutes the vsphereMachine controller gets the latest copy of the Cluster object and reconciles VsphereMachine successfully because by then the Cluster.Status.InfrastructureReady field has been set to true.
  5. So the solution is either reducing the sync period time so that CAPV gets the latest Cluster object sooner or requeuing VsphereMachine until Cluster.Status.InfrastructureReady is true. It is recommended to requeue the object instead of reducing the sync period in this case as per these comments in the controller-runtime code.
  6. This PR modifies the watch on Cluster object, to receive update events when cluster.status.infrastructureReady field becomes true. This way it enqueues the vsphereMachine for reconcilation once infrastructure is ready post clusterctl move

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1211

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 13, 2021
@k8s-ci-robot
Copy link
Contributor

Welcome @mrajashree!

It looks like this is your first PR to kubernetes-sigs/cluster-api-provider-vsphere 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api-provider-vsphere has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @mrajashree. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jul 13, 2021
@neolit123
Copy link
Member

deferring to the CAPV maintainers:
/assign yastij randomvariable

@neolit123
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 14, 2021
conditions.MarkFalse(ctx.VSphereMachine, infrav1.VMProvisionedCondition, infrav1.WaitingForClusterInfrastructureReason, clusterv1.ConditionSeverityInfo, "")
return reconcile.Result{}, nil
return reconcile.Result{RequeueAfter: 30*time.Second}, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a watch here instead of requeueing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vincepri oh you mean just like there's a watch on Cluster for the VSphereCluster object, we should add one mapping Cluster to VSphereMachine?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah if we're waiting for it, we probably should add watch instead of using RequeueAfter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh looks like there is a watch for Cluster, just need to update the predicate to check for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested changing the update predicate for the watch on Cluster but I see the same, machines reconcile after a while. I'll see if I missed something and test again. Only requeuing seems to work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The watch on cluster object's infrastructureReady field change wasn't working because it wasn't enqueuing the vsphereMachine object. Opened a PR for that #1306
With the change in clusterToVsphereMachine mapper, updating the watch on cluster object to enqueue vsphereMachine when cluster.status.infrastructureReady becomes true works. I'll rebase this PR when the other one is merged

@srm09
Copy link
Contributor

srm09 commented Jul 19, 2021

Retriggering, seems to be an unrelated failure

@srm09
Copy link
Contributor

srm09 commented Jul 19, 2021

/test pull-cluster-api-provider-vsphere-e2e

@mrajashree mrajashree changed the title Requeue vsphereMachine if cluster infrastructure is not ready [WIP] Requeue vsphereMachine if cluster infrastructure is not ready Jul 19, 2021
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 19, 2021
@yastij
Copy link
Member

yastij commented Jul 22, 2021

/assign @srm09

@yastij
Copy link
Member

yastij commented Jul 26, 2021

/retest

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 24, 2021
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 13, 2021
@mrajashree mrajashree changed the base branch from release-0.7 to master November 13, 2021 00:40
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 13, 2021
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 13, 2021
@mrajashree mrajashree changed the title [WIP] Requeue vsphereMachine if cluster infrastructure is not ready Requeue vsphereMachine if cluster infrastructure is not ready Nov 13, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 13, 2021
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Nov 13, 2021

@mrajashree: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-vsphere-verify-shell a4156ff link true /test pull-cluster-api-provider-vsphere-verify-shell
pull-cluster-api-provider-vsphere-verify-markdown a4156ff link true /test pull-cluster-api-provider-vsphere-verify-markdown

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@mrajashree mrajashree changed the title Requeue vsphereMachine if cluster infrastructure is not ready Add watch on cluster.status.infrastructureReady to reconcile vsphereMachine when cluster infrastructure is ready Nov 15, 2021
@mrajashree
Copy link
Contributor Author

Updated the PR to replace predicates on cluster watch with ClusterUnpausedAndInfrastructureReady
This covers create events when Cluster.Spec.Paused is false and Cluster.Status.InfrastructureReady is true
And covers update events when either Cluster.Spec.Paused transitions to false or Cluster.Status.InfrastructureReady transitions to true

@srm09
Copy link
Contributor

srm09 commented Nov 15, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 15, 2021
@mrajashree
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 17, 2021
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 26, 2021
@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 26, 2021
@srm09
Copy link
Contributor

srm09 commented Nov 30, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 30, 2021
@srm09
Copy link
Contributor

srm09 commented Nov 30, 2021

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: srm09

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 30, 2021
@k8s-ci-robot k8s-ci-robot merged commit 737850c into kubernetes-sigs:master Nov 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

After clusterctl move, vsphere worker machines take ~10-12 minutes to go to "Running" phase
8 participants