Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The object 'vim.VirtualMachine:vm-988369' has already been deleted or has not been completely created" #2495

Closed
dzjiang91 opened this issue Aug 7, 2023 · 5 comments
Assignees

Comments

@dzjiang91
Copy link

dzjiang91 commented Aug 7, 2023

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

We recreated nodes one by one with the same node name (deleting a node and creating a new node with the same name but different VM) and vSphere controller was not restarted. After that, we observed PVC creation failure due to volume list failure:

2023-08-03T21:08:14.583819594Z {"level":"error","time":"2023-08-03T21:08:14.583619278Z","caller":"vanilla/controller_helper.go:190","msg":"failed to get VM managed objects from VM objects, err: ServerFaultCode: The object 'vim.VirtualMachine:vm-988354' has already been deleted or has not been completely created","TraceId":"aaaa245f-716f-4044-88c3-9505a183f368","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.getBlockVolumeIDToNodeUUIDMap\n\t/build/pkg/csi/service/vanilla/controller_helper.go:190\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).processQueryResultsListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2582\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes.func1\n\t/build/pkg/csi/service/vanilla/controller.go:2542\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2556\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ListVolumes_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5761\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}

What you expected to happen:
No failure of PVC requests after node recreation.

How to reproduce it (as minimally and precisely as possible):
IIUC, this failure is caused by race condition and I dont have a steady way to reproduce this. We may produce it by recreating node with the same name and try PVC creation.

Anything else we need to know?:

From code reading and logs, the vsphere csi controller watches CSINode object and keeps two maps: 'nodeNameToUUID' (node name to provider id mapping) and 'nodeVMs' (provider id to vm information mapping). When there is a csinode deletion, it removes node name as well as its provider id in both maps. However, during the machine recreation, the vsphere csi controller did not observe csinode removal but observed that the csinode is updated with new provider id. In that case, it updates 'nodeNameToUUID' and nodeVMs with new provider id without deleting old machine in the 'nodeVMs'.

During PVC creation, vsphere controller tried to list attached volumes from all machines in nodeVMs. Because nodeVMs contained a machine that was already removed, the list operation failed as what we observed.

Node first registered:

2023-08-03T20:53:13.376273735Z {"level":"info","time":"2023-08-03T20:53:13.37621663Z","caller":"node/manager.go:139","msg":"Successfully registered node: \"f4517b0a375a-qual-364-1501abe9\" with nodeUUID \"4211836d-87b7-9c59-4988-3846de984083\"","TraceId":"c6656284-2401-467f-841a-a42e735bf37f"}

Node get registered again with new provider ID. There were no node unregister logs between so old machine with provider id 4211836d-87b7-9c59-4988-3846de984083 was not removed from nodeVMs.

2023-08-03T21:07:58.775821047Z {"level":"info","time":"2023-08-03T21:07:58.775731605Z","caller":"node/manager.go:139","msg":"Successfully registered node: \"f4517b0a375a-qual-364-1501abe9\" with nodeUUID \"421151be-d305-3f8e-6f3b-c09251535b91\"","TraceId":"dd6d8d95-a507-4523-ab8e-56c4d3737527"}

Subsequent list volume requests failed after attempting to get information of an old machine that was already deleted:

2023-08-03T21:10:14.620668764Z {"level":"error","time":"2023-08-03T21:10:14.620498747Z","caller":"vanilla/controller_helper.go:190","msg":"failed to get VM managed objects from VM objects, err: ServerFaultCode: The object 'vim.VirtualMachine:vm-988369' has already been deleted or has not been completely created","TraceId":"6cc4c5a8-e6c0-446d-b309-f4abb46bc0f7","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.getBlockVolumeIDToNodeUUIDMap\n\t/build/pkg/csi/service/vanilla/controller_helper.go:190\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).processQueryResultsListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2582\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes.func1\n\t/build/pkg/csi/service/vanilla/controller.go:2542\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2556\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ListVolumes_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5761\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}

Environment:

  • csi-vsphere version:v3.0.2
  • vsphere-cloud-controller-manager version:
  • Kubernetes version:v1.27
  • vSphere version:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@jingxu97
Copy link

jingxu97 commented Aug 8, 2023

/cc @divyenpatel @xing-yang

@chethanv28
Copy link
Collaborator

chethanv28 commented Sep 12, 2023

@jingxu97 @dzjiang91
We have recently fixed an issue in Node Manager component of vSphere CSI Driver where a similar error was observed due to stale cache.

Here is the PR - #2546.

The PR was merged couple of days ago on master branch and hence the fix will be either available in 3.1.1 or 3.2.0 whichever is earliest.

@chethanv28 chethanv28 self-assigned this Sep 13, 2023
@divyenpatel
Copy link
Member

Fix will be available in the v3.1.1 patch release.
Issue is fixed on the master branch.

@jingxu97
Copy link

Thank you @chethanv28 and @divyenpatel

I am wondering whether the fix will be cherrypicked to 3.0.x release? Thank you!

@chethanv28
Copy link
Collaborator

chethanv28 commented Oct 18, 2023

Thank you @chethanv28 and @divyenpatel

I am wondering whether the fix will be cherrypicked to 3.0.x release? Thank you!

@jingxu97 Yes, the fix is being cherry-picked to 3.0.3 release - #2561
We are in the process of finishing the validation for 3.0.3 patch release. Will leave a note here once the patch is out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants