Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VMs wouldn't power on when using a storage policy #1601

Closed
johananl opened this issue Aug 19, 2022 · 16 comments
Closed

VMs wouldn't power on when using a storage policy #1601

johananl opened this issue Aug 19, 2022 · 16 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@johananl
Copy link
Member

johananl commented Aug 19, 2022

/kind bug

What steps did you take and what happened:
I tried running an e2e test using GINKGO_FOCUS="\[PR-Blocking\]" make e2e and the test timed out because the controller VM gets created but stays powered off. Further investigation showed that the CAPV controller never progresses past the following line:

if err := vms.reconcileStoragePolicy(vmCtx); err != nil {

This in turns is caused by the fact that the following call never returns:

entities, err := pbmClient.QueryAssociatedEntity(ctx, pbmTypes.PbmProfileId{UniqueId: storageProfileID}, "virtualDiskId")

What did you expect to happen:
I expected the VM to power on and the test to proceed.

Anything else you would like to add:
I was able to isolate the bug and it's either a problem in govmomi or a problem on the vSphere server side. I've opened an upstream issue: vmware/govmomi#2929

Workaround: Setting the VSPHERE_STORAGE_POLICY variable to an empty string makes the test converge.

The bug seems to occur also when creating clusters manually, i.e. it isn't an e2e-specific thing.

Environment:

  • Cluster-api-provider-vsphere version: main at d2494c3
  • Kubernetes version: (use kubectl version): n/a
  • OS (e.g. from /etc/os-release): n/a
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 19, 2022
@johananl johananl changed the title VMs wouldn't power on when using a storage policy VMs wouldn't power on in e2e test when using a storage policy Aug 19, 2022
@johananl johananl changed the title VMs wouldn't power on in e2e test when using a storage policy VMs wouldn't power on when using a storage policy Aug 19, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 17, 2022
@johananl
Copy link
Member Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 17, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 15, 2023
@johananl
Copy link
Member Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 15, 2023
@srm09
Copy link
Contributor

srm09 commented Feb 16, 2023

Since there has been no movement on the govmomi issue that's linked above, this would save you from re-adding the label over and over.
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 16, 2023
@chrischdi
Copy link
Member

chrischdi commented Aug 17, 2023

Is the root-cause here that the used file https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/main/test/e2e/config/vsphere-dev.yaml#L153 or https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/main/test/e2e/config/vsphere-ci.yaml#L147 did refer a storage policy which did not exist in the environment?

So there would be two issues:

  • Document better how to run the E2E tests on a custom environment (especially about the storage policy variable)
  • Check handling / reproduce the issue of setting a storage policy which does not exist, maybe timeouts could get improved here.

@johananl
Copy link
Member Author

I'd say the root cause is the upstream issue vmware/govmomi#2929, but still I opened this issue here because the upstream issue affects CAPV developers and it isn't obvious right away what is causing the problem.

Both of your suggested points make sense to me @chrischdi.

@sbueringer
Copy link
Member

sbueringer commented Oct 31, 2023

@johananl Wondering if this PR potentially solves/changes the issue: #2467

During the debug of #2453 (comment) we have figured out this call is too expensive for environments with a lot of resources, like more than 10k resources.
This may turn into a huge slow query that can take up to 20 minutes.

@johananl
Copy link
Member Author

johananl commented Nov 2, 2023

@sbueringer the govmomi bug was never addressed and at some point I got tired of doing remove-lifecycle stale...

Looking at #2467, we no longer call the problematic govmomi function, however since we don't know the root cause I can't be certain the problem is gone.

Unfortunately, I don't have access to a vSphere environment on which I can try to reproduce this right now. As far as I'm concerned we can close this, especially given that there seems to be no traction here nor on the govmomi issue.

@sbueringer
Copy link
Member

sbueringer commented Nov 13, 2023

Hm yeah. It's hard to tell without being able to reproduce it what the problem is.

Was your environment similar to the one described on #2467?

we have figured out this call is too expensive for environments with a lot of resources, like more than 10k resources.

@johananl
Copy link
Member Author

johananl commented Nov 13, 2023

Maybe, I'm not sure what "resource" means in this context. I also didn't have full visibility over the environment since it's managed by my employer for multiple purposes and therefore I didn't have admin privileges.

@sbueringer
Copy link
Member

sbueringer commented Nov 14, 2023

Alright, thx! Let's close this issue for now. Please re-open in case you (or anyone else seeing this issue the future) observes this behavior with a CAPV version including #2467

@sbueringer
Copy link
Member

/close

@k8s-ci-robot
Copy link
Contributor

@sbueringer: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@johananl
Copy link
Member Author

SGTM, thanks for consistently being a very reliable maintainer @sbueringer 🙏

@sbueringer
Copy link
Member

Thank you! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

6 participants