VMs wouldn't power on when using a storage policy #1601

johananl · 2022-08-19T09:50:17Z

/kind bug

What steps did you take and what happened:
I tried running an e2e test using GINKGO_FOCUS="\[PR-Blocking\]" make e2e and the test timed out because the controller VM gets created but stays powered off. Further investigation showed that the CAPV controller never progresses past the following line:

cluster-api-provider-vsphere/pkg/services/govmomi/service.go

Line 131 in a7b6edc

if err := vms.reconcileStoragePolicy(vmCtx); err != nil {

This in turns is caused by the fact that the following call never returns:

cluster-api-provider-vsphere/pkg/services/govmomi/service.go

Line 322 in a7b6edc

    
           entities, err := pbmClient.QueryAssociatedEntity(ctx, pbmTypes.PbmProfileId{UniqueId: storageProfileID}, "virtualDiskId")

What did you expect to happen:
I expected the VM to power on and the test to proceed.

Anything else you would like to add:
I was able to isolate the bug and it's either a problem in govmomi or a problem on the vSphere server side. I've opened an upstream issue: vmware/govmomi#2929

Workaround: Setting the VSPHERE_STORAGE_POLICY variable to an empty string makes the test converge.

The bug seems to occur also when creating clusters manually, i.e. it isn't an e2e-specific thing.

Environment:

Cluster-api-provider-vsphere version: main at d2494c3
Kubernetes version: (use kubectl version): n/a
OS (e.g. from /etc/os-release): n/a

The text was updated successfully, but these errors were encountered:

k8s-triage-robot · 2022-11-17T10:58:28Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

johananl · 2022-11-17T11:25:32Z

/remove-lifecycle stale

k8s-triage-robot · 2023-02-15T12:11:38Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

johananl · 2023-02-15T14:53:54Z

/remove-lifecycle stale

srm09 · 2023-02-16T18:42:43Z

Since there has been no movement on the govmomi issue that's linked above, this would save you from re-adding the label over and over.
/lifecycle frozen

chrischdi · 2023-08-17T17:47:30Z

Is the root-cause here that the used file https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/main/test/e2e/config/vsphere-dev.yaml#L153 or https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/main/test/e2e/config/vsphere-ci.yaml#L147 did refer a storage policy which did not exist in the environment?

So there would be two issues:

Document better how to run the E2E tests on a custom environment (especially about the storage policy variable)
Check handling / reproduce the issue of setting a storage policy which does not exist, maybe timeouts could get improved here.

johananl · 2023-08-21T09:11:36Z

I'd say the root cause is the upstream issue vmware/govmomi#2929, but still I opened this issue here because the upstream issue affects CAPV developers and it isn't obvious right away what is causing the problem.

Both of your suggested points make sense to me @chrischdi.

sbueringer · 2023-10-31T11:18:12Z

@johananl Wondering if this PR potentially solves/changes the issue: #2467

During the debug of #2453 (comment) we have figured out this call is too expensive for environments with a lot of resources, like more than 10k resources.
This may turn into a huge slow query that can take up to 20 minutes.

johananl · 2023-11-02T11:12:31Z

@sbueringer the govmomi bug was never addressed and at some point I got tired of doing remove-lifecycle stale...

Looking at #2467, we no longer call the problematic govmomi function, however since we don't know the root cause I can't be certain the problem is gone.

Unfortunately, I don't have access to a vSphere environment on which I can try to reproduce this right now. As far as I'm concerned we can close this, especially given that there seems to be no traction here nor on the govmomi issue.

sbueringer · 2023-11-13T14:57:34Z

Hm yeah. It's hard to tell without being able to reproduce it what the problem is.

Was your environment similar to the one described on #2467?

we have figured out this call is too expensive for environments with a lot of resources, like more than 10k resources.

johananl · 2023-11-13T16:15:23Z

Maybe, I'm not sure what "resource" means in this context. I also didn't have full visibility over the environment since it's managed by my employer for multiple purposes and therefore I didn't have admin privileges.

sbueringer · 2023-11-14T10:36:51Z

Alright, thx! Let's close this issue for now. Please re-open in case you (or anyone else seeing this issue the future) observes this behavior with a CAPV version including #2467

sbueringer · 2023-11-14T10:37:06Z

/close

k8s-ci-robot · 2023-11-14T10:37:11Z

@sbueringer: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

johananl · 2023-11-14T11:22:54Z

SGTM, thanks for consistently being a very reliable maintainer @sbueringer 🙏

sbueringer · 2023-11-14T11:29:08Z

Thank you! :)

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 19, 2022

johananl changed the title ~~VMs wouldn't power on when using a storage policy~~ VMs wouldn't power on in e2e test when using a storage policy Aug 19, 2022

johananl changed the title ~~VMs wouldn't power on in e2e test when using a storage policy~~ VMs wouldn't power on when using a storage policy Aug 19, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 17, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 17, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 15, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 15, 2023

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 16, 2023

k8s-ci-robot closed this as completed Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VMs wouldn't power on when using a storage policy #1601

VMs wouldn't power on when using a storage policy #1601

johananl commented Aug 19, 2022 •

edited

k8s-triage-robot commented Nov 17, 2022

johananl commented Nov 17, 2022

k8s-triage-robot commented Feb 15, 2023

johananl commented Feb 15, 2023

srm09 commented Feb 16, 2023

chrischdi commented Aug 17, 2023 •

edited

johananl commented Aug 21, 2023

sbueringer commented Oct 31, 2023 •

edited

johananl commented Nov 2, 2023

sbueringer commented Nov 13, 2023 •

edited

johananl commented Nov 13, 2023 •

edited

sbueringer commented Nov 14, 2023 •

edited

sbueringer commented Nov 14, 2023

k8s-ci-robot commented Nov 14, 2023

johananl commented Nov 14, 2023

sbueringer commented Nov 14, 2023

VMs wouldn't power on when using a storage policy #1601

VMs wouldn't power on when using a storage policy #1601

Comments

johananl commented Aug 19, 2022 • edited

k8s-triage-robot commented Nov 17, 2022

johananl commented Nov 17, 2022

k8s-triage-robot commented Feb 15, 2023

johananl commented Feb 15, 2023

srm09 commented Feb 16, 2023

chrischdi commented Aug 17, 2023 • edited

johananl commented Aug 21, 2023

sbueringer commented Oct 31, 2023 • edited

johananl commented Nov 2, 2023

sbueringer commented Nov 13, 2023 • edited

johananl commented Nov 13, 2023 • edited

sbueringer commented Nov 14, 2023 • edited

sbueringer commented Nov 14, 2023

k8s-ci-robot commented Nov 14, 2023

johananl commented Nov 14, 2023

sbueringer commented Nov 14, 2023

johananl commented Aug 19, 2022 •

edited

chrischdi commented Aug 17, 2023 •

edited

sbueringer commented Oct 31, 2023 •

edited

sbueringer commented Nov 13, 2023 •

edited

johananl commented Nov 13, 2023 •

edited

sbueringer commented Nov 14, 2023 •

edited