Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prevent calling register volume in parallel for same vmdk #1398

Conversation

divyenpatel
Copy link
Member

@divyenpatel divyenpatel commented Nov 3, 2021

What this PR does / why we need it:

We have a bug in the FCD which might generate different FCD IDs for the same vmdk. If this happens, migrated volume is not guaranteed to be used with the cached Volume ID in the VolumeMigrationService.

This PR prevents calling Register Volume in parallel for the same vmdk.

This PR also prevents calling RegisterVolume from the CSI Controller Pod. Only the syncer container will be able to perform volume registration after this PR is merged. If for any reason, volume registration fails, the syncer container can attempt to re-register the volume during a full sync cycle. If vSphere CSI controller gets any call to operate on the migrated volume before it is registered, it will return an error, and CSI driver will retry the operation in the next attempt.

Testing done:
Verified enabling Migration when syncer is down, so migrated PV did not register before creating the Pod.
During Pod Creation, Controller Publish failed with an expected error.

Events:
  Type     Reason                  Age                From                     Message
  ----     ------                  ----               ----                     -------
  Normal   Scheduled               60s                default-scheduler        Successfully assigned default/pvpod to k8s-node-470-1636402267
  Warning  FailedAttachVolume      19s (x6 over 37s)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-ab2abaf9-6704-436e-b958-05411e9a86af" : rpc error: code = Internal desc = failed to get VolumeID from volumeMigrationService for volumePath: "[vsanDatastore] 83b38961-4a3f-c694-0552-0200abf20dd7/kubernetes-dynamic-pvc-ab2abaf9-6704-436e-b958-05411e9a86af.vmdk"
  Normal   SuccessfulAttachVolume  2s                 attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-ab2abaf9-6704-436e-b958-05411e9a86af"

and later when the syncer container has registered vmdk as FCD, attach operation succeeded.

cc: @bandyopadhyays-vmware @chethanv28

Release note:

prevent calling register volume in parallel for same vmdk

@divyenpatel divyenpatel added this to the vSphere CSI Migration milestone Nov 3, 2021
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 3, 2021
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 3, 2021
@divyenpatel
Copy link
Member Author

@chethanv28 I have updated "testing done" section on the PR. Can you review this PR?

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chethanv28, divyenpatel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [chethanv28,divyenpatel]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@divyenpatel
Copy link
Member Author

@chethanv28 can we merge this PR?

@svcbot-qecnsdp
Copy link

Started vanilla Block pipeline... Build Number: 104

@svcbot-qecnsdp
Copy link

Block vanilla build status: FAILURE 
Stage before exit: e2e-tests 
Jenkins E2E Test Results: 

Ran 13 of 251 Specs in 6378.089 seconds
FAIL! -- 0 Passed | 13 Failed | 0 Pending | 238 Skipped
--- FAIL: TestE2E (6378.20s)
FAIL

Ginkgo ran 1 suite in 1h53m15.08720705s
Test Suite Failed

@divyenpatel
Copy link
Member Author

Block vanilla build status: FAILURE 
Stage before exit: e2e-tests 
Jenkins E2E Test Results: 

Ran 13 of 251 Specs in 6378.089 seconds
FAIL! -- 0 Passed | 13 Failed | 0 Pending | 238 Skipped
--- FAIL: TestE2E (6378.20s)
FAIL

Ginkgo ran 1 suite in 1h53m15.08720705s
Test Suite Failed

These failures are not related to the change. On the setup csi-migration was disabled.
All failures are related to parsing storage class parameters failed with error: invalid param: “csimigration” and value: “true”.

@chethanv28 we can merge this PR safely.

@chethanv28
Copy link
Collaborator

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 11, 2021
@k8s-ci-robot k8s-ci-robot merged commit edfc277 into kubernetes-sigs:master Nov 11, 2021
divyenpatel added a commit to divyenpatel/vsphere-csi-driver that referenced this pull request Dec 13, 2021
divyenpatel added a commit to divyenpatel/vsphere-csi-driver that referenced this pull request Jan 6, 2022
divyenpatel added a commit to divyenpatel/vsphere-csi-driver that referenced this pull request Jan 6, 2022
k8s-ci-robot pushed a commit that referenced this pull request Jan 6, 2022
* prevent calling register volume in parallel for same vmdk (#1398)

* Fix parsing of volume path (#1451)

Remove datastore name from VolumePath using TrimPrefix. Trim will remove all
matching characters from the end of the VolumePath, e.g. it can remove
"vmdk" if the datastore name contains charactes "v", "m", "d" and "k"
(anywhere in the datastore name).

* v2.2.3-rc.1 yaml files

* linter fixes

Co-authored-by: Jan Šafránek <jsafrane@redhat.com>
k8s-ci-robot pushed a commit that referenced this pull request Jan 6, 2022
* prevent calling register volume in parallel for same vmdk (#1398)

* Fix parsing of volume path (#1451)

Remove datastore name from VolumePath using TrimPrefix. Trim will remove all
matching characters from the end of the VolumePath, e.g. it can remove
"vmdk" if the datastore name contains charactes "v", "m", "d" and "k"
(anywhere in the datastore name).

* updating driver tags for v2.3.1-rc.1 release

* linter fixes

Co-authored-by: Jan Šafránek <jsafrane@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-2.4.1-candidate size/L Denotes a PR that changes 100-499 lines, ignoring generated files. v2.4.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants