The object 'vim.VirtualMachine:vm-988369' has already been deleted or has not been completely created" #2495

dzjiang91 · 2023-08-07T17:15:27Z

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

We recreated nodes one by one with the same node name (deleting a node and creating a new node with the same name but different VM) and vSphere controller was not restarted. After that, we observed PVC creation failure due to volume list failure:

2023-08-03T21:08:14.583819594Z {"level":"error","time":"2023-08-03T21:08:14.583619278Z","caller":"vanilla/controller_helper.go:190","msg":"failed to get VM managed objects from VM objects, err: ServerFaultCode: The object 'vim.VirtualMachine:vm-988354' has already been deleted or has not been completely created","TraceId":"aaaa245f-716f-4044-88c3-9505a183f368","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.getBlockVolumeIDToNodeUUIDMap\n\t/build/pkg/csi/service/vanilla/controller_helper.go:190\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).processQueryResultsListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2582\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes.func1\n\t/build/pkg/csi/service/vanilla/controller.go:2542\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2556\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ListVolumes_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5761\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}

What you expected to happen:
No failure of PVC requests after node recreation.

How to reproduce it (as minimally and precisely as possible):
IIUC, this failure is caused by race condition and I dont have a steady way to reproduce this. We may produce it by recreating node with the same name and try PVC creation.

Anything else we need to know?:

From code reading and logs, the vsphere csi controller watches CSINode object and keeps two maps: 'nodeNameToUUID' (node name to provider id mapping) and 'nodeVMs' (provider id to vm information mapping). When there is a csinode deletion, it removes node name as well as its provider id in both maps. However, during the machine recreation, the vsphere csi controller did not observe csinode removal but observed that the csinode is updated with new provider id. In that case, it updates 'nodeNameToUUID' and nodeVMs with new provider id without deleting old machine in the 'nodeVMs'.

During PVC creation, vsphere controller tried to list attached volumes from all machines in nodeVMs. Because nodeVMs contained a machine that was already removed, the list operation failed as what we observed.

Node first registered:

2023-08-03T20:53:13.376273735Z {"level":"info","time":"2023-08-03T20:53:13.37621663Z","caller":"node/manager.go:139","msg":"Successfully registered node: \"f4517b0a375a-qual-364-1501abe9\" with nodeUUID \"4211836d-87b7-9c59-4988-3846de984083\"","TraceId":"c6656284-2401-467f-841a-a42e735bf37f"}

Node get registered again with new provider ID. There were no node unregister logs between so old machine with provider id 4211836d-87b7-9c59-4988-3846de984083 was not removed from nodeVMs.

2023-08-03T21:07:58.775821047Z {"level":"info","time":"2023-08-03T21:07:58.775731605Z","caller":"node/manager.go:139","msg":"Successfully registered node: \"f4517b0a375a-qual-364-1501abe9\" with nodeUUID \"421151be-d305-3f8e-6f3b-c09251535b91\"","TraceId":"dd6d8d95-a507-4523-ab8e-56c4d3737527"}

Subsequent list volume requests failed after attempting to get information of an old machine that was already deleted:

2023-08-03T21:10:14.620668764Z {"level":"error","time":"2023-08-03T21:10:14.620498747Z","caller":"vanilla/controller_helper.go:190","msg":"failed to get VM managed objects from VM objects, err: ServerFaultCode: The object 'vim.VirtualMachine:vm-988369' has already been deleted or has not been completely created","TraceId":"6cc4c5a8-e6c0-446d-b309-f4abb46bc0f7","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.getBlockVolumeIDToNodeUUIDMap\n\t/build/pkg/csi/service/vanilla/controller_helper.go:190\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).processQueryResultsListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2582\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes.func1\n\t/build/pkg/csi/service/vanilla/controller.go:2542\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2556\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ListVolumes_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5761\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}

Environment:

csi-vsphere version:v3.0.2
vsphere-cloud-controller-manager version:
Kubernetes version:v1.27
vSphere version:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

jingxu97 · 2023-08-08T22:59:25Z

/cc @divyenpatel @xing-yang

chethanv28 · 2023-09-12T21:57:12Z

@jingxu97 @dzjiang91
We have recently fixed an issue in Node Manager component of vSphere CSI Driver where a similar error was observed due to stale cache.

Here is the PR - #2546.

The PR was merged couple of days ago on master branch and hence the fix will be either available in 3.1.1 or 3.2.0 whichever is earliest.

divyenpatel · 2023-10-02T20:43:11Z

Fix will be available in the v3.1.1 patch release.
Issue is fixed on the master branch.

jingxu97 · 2023-10-18T21:54:05Z

Thank you @chethanv28 and @divyenpatel

I am wondering whether the fix will be cherrypicked to 3.0.x release? Thank you!

chethanv28 · 2023-10-18T21:57:15Z

Thank you @chethanv28 and @divyenpatel

I am wondering whether the fix will be cherrypicked to 3.0.x release? Thank you!

@jingxu97 Yes, the fix is being cherry-picked to 3.0.3 release - #2561
We are in the process of finishing the validation for 3.0.3 patch release. Will leave a note here once the patch is out.

chethanv28 self-assigned this Sep 13, 2023

divyenpatel closed this as completed Oct 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The object 'vim.VirtualMachine:vm-988369' has already been deleted or has not been completely created" #2495

The object 'vim.VirtualMachine:vm-988369' has already been deleted or has not been completely created" #2495

dzjiang91 commented Aug 7, 2023 •

edited

Loading

jingxu97 commented Aug 8, 2023 •

edited

Loading

chethanv28 commented Sep 12, 2023 •

edited

Loading

divyenpatel commented Oct 2, 2023

jingxu97 commented Oct 18, 2023

chethanv28 commented Oct 18, 2023 •

edited

Loading

The object 'vim.VirtualMachine:vm-988369' has already been deleted or has not been completely created" #2495

The object 'vim.VirtualMachine:vm-988369' has already been deleted or has not been completely created" #2495

Comments

dzjiang91 commented Aug 7, 2023 • edited Loading

jingxu97 commented Aug 8, 2023 • edited Loading

chethanv28 commented Sep 12, 2023 • edited Loading

divyenpatel commented Oct 2, 2023

jingxu97 commented Oct 18, 2023

chethanv28 commented Oct 18, 2023 • edited Loading

dzjiang91 commented Aug 7, 2023 •

edited

Loading

jingxu97 commented Aug 8, 2023 •

edited

Loading

chethanv28 commented Sep 12, 2023 •

edited

Loading

chethanv28 commented Oct 18, 2023 •

edited

Loading