Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set volume health to inaccessible when PVC not found in CNS #944

Merged
merged 1 commit into from
Jun 8, 2021

Conversation

xing-yang
Copy link
Contributor

@xing-yang xing-yang commented Jun 2, 2021

What this PR does / why we need it:
Set volume health to inaccessible when PVC not found in CNS

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Testing done:
https://container-dp.svc.eng.vmware.com/view/Pre-Checkin-CSI/job/csi-wcp-pre-check-in/29/testReport/(root)/CNS%20CSI%20Driver%20End-to-End%20Tests/

13 tests passed, 1 failed.

CNS CSI Driver End-to-End Tests.Basic Static Provisioning [csi-supervisor] Verify static provisioning workflow - when DuplicateFCD is used

/home/worker/workspace/csi-wcp-pre-check-in@2/Results/29/vsphere-csi-driver/tests/e2e/csi_static_provisioning_basic.go:882
Unexpected error:
<*errors.errorString | 0xc000eaed10>: {
s: "PersistentVolume static-pv-561ef975-ada8-4c07-a801-5a5a5c78f4ad still exists within 3m0s",
}
PersistentVolume static-pv-561ef975-ada8-4c07-a801-5a5a5c78f4ad still exists within 3m0s
occurred
/home/worker/workspace/csi-wcp-pre-check-in@2/Results/29/vsphere-csi-driver/tests/e2e/csi_static_provisioning_basic.go:959

The test failure is not caused by the change.

Special notes for your reviewer:

Release note:

Set volume health to inaccessible when PVC is not found in CNS

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 2, 2021
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 2, 2021
@xing-yang
Copy link
Contributor Author

jtest wcp

@xing-yang
Copy link
Contributor Author

jtest gc

@svcbot-qecnsdp
Copy link

Started WCP block pipeline...

@svcbot-qecnsdp
Copy link

Started GC block pipeline...

@svcbot-qecnsdp
Copy link

WCP build status: FAILURE 
Stage before exit: null 

@xing-yang
Copy link
Contributor Author

jtest wcp

@svcbot-qecnsdp
Copy link

Started WCP block pipeline...

@svcbot-qecnsdp
Copy link

WCP build status: FAILURE 
Stage before exit: null 

@svcbot-qecnsdp
Copy link

GC build status: FAILURE 
Stage before exit: deploy-gc-testbed 

@svcbot-qecnsdp
Copy link

Started WCP block pipeline...

@svcbot-qecnsdp
Copy link

Build ID: 18
WCP build status: FAILURE 
Stage before exit: null 
Jenkins E2E Test Results: 
Ran 14 of 187 Specs in 82.929 seconds
FAIL! -- 0 Passed | 14 Failed | 0 Pending | 173 Skipped
--- FAIL: TestE2E (83.06s)
FAIL

Ginkgo ran 1 suite in 1m51.461080604s
Test Suite Failed
make: Leaving directory `/home/worker/workspace/csi-wcp-pre-check-in/Results/18/vsphere-csi-driver`

@svcbot-qecnsdp
Copy link

Started WCP block pipeline...

@svcbot-qecnsdp
Copy link

Build ID: 19
WCP build status: FAILURE 
Stage before exit: null 

@svcbot-qecnsdp
Copy link

Started WCP block pipeline...

@svcbot-qecnsdp
Copy link

Build ID: 20
WCP build status: FAILURE 
Stage before exit: null 
Jenkins E2E Test Results: 
Ran 14 of 187 Specs in 4483.011 seconds
FAIL! -- 0 Passed | 14 Failed | 0 Pending | 173 Skipped
--- FAIL: TestE2E (4483.14s)
FAIL

Ginkgo ran 1 suite in 1h15m7.870977408s
Test Suite Failed
make: Leaving directory `/home/worker/workspace/csi-wcp-pre-check-in/Results/20/vsphere-csi-driver`

@svcbot-qecnsdp
Copy link

Started WCP block pipeline...

@svcbot-qecnsdp
Copy link

Build ID: 21
WCP build status: FAILURE 
Stage before exit: null 

@svcbot-qecnsdp
Copy link

Started WCP block pipeline...

@svcbot-qecnsdp
Copy link

Build ID: 22
WCP build status: FAILURE 
Stage before exit: null 

@svcbot-qecnsdp
Copy link

Started WCP block pipeline...

@svcbot-qecnsdp
Copy link

Build ID: 26
WCP build status: FAILURE 
Stage before exit: null 
Jenkins E2E Test Results: 
Ran 14 of 187 Specs in 4673.084 seconds
FAIL! -- 1 Passed | 13 Failed | 0 Pending | 173 Skipped
--- FAIL: TestE2E (4673.22s)
FAIL

Ginkgo ran 1 suite in 1h18m22.430803379s
Test Suite Failed
make: Leaving directory `/home/worker/workspace/csi-wcp-pre-check-in/Results/26/vsphere-csi-driver`

@svcbot-qecnsdp
Copy link

Started WCP block pipeline...

@svcbot-qecnsdp
Copy link

Build ID: 27
WCP build status: FAILURE 
Stage before exit: null 
Jenkins E2E Test Results: 
Ran 14 of 187 Specs in 106.904 seconds
FAIL! -- 0 Passed | 14 Failed | 0 Pending | 173 Skipped
--- FAIL: TestE2E (107.02s)
FAIL

Ginkgo ran 1 suite in 2m14.416897446s
Test Suite Failed
make: Leaving directory `/home/worker/workspace/csi-wcp-pre-check-in/Results/27/vsphere-csi-driver`

@svcbot-qecnsdp
Copy link

Started WCP block pipeline...

@svcbot-qecnsdp
Copy link

Build ID: 30
WCP build status: SUCCESS 
Stage before exit: finally 
Jenkins E2E Test Results: 

Ran 14 of 187 Specs in 3253.154 seconds
SUCCESS! -- 14 Passed | 0 Failed | 0 Pending | 173 Skipped
PASS

Ginkgo ran 1 suite in 54m43.233742796s
Test Suite Passed
make: Leaving directory `/home/worker/workspace/csi-wcp-pre-check-in/Results/30/vsphere-csi-driver`

pkg/syncer/volume_health.go Outdated Show resolved Hide resolved
pkg/syncer/volume_health.go Outdated Show resolved Hide resolved
@lipingxue
Copy link
Contributor

@xing-yang Could you run some test to make sure when PVC not found in CNS, the volume health status is set to inaccessible?

@xing-yang
Copy link
Contributor Author

@xing-yang Could you run some test to make sure when PVC not found in CNS, the volume health status is set to inaccessible?

This is a rare corner case. I don't know how to reproduce this case. Any suggestions?

@svcbot-qecnsdp
Copy link

Started WCP block pipeline...

@svcbot-qecnsdp
Copy link

Build ID: 31
WCP build status: SUCCESS 
Stage before exit: finally 
Jenkins E2E Test Results: 

Ran 1 of 187 Specs in 358.073 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 186 Skipped
PASS

Ginkgo ran 1 suite in 6m26.377712229s
Test Suite Passed
make: Leaving directory `/home/worker/workspace/csi-wcp-pre-check-in/Results/31/vsphere-csi-driver`

@lipingxue
Copy link
Contributor

lipingxue commented Jun 4, 2021

@xing-yang Could you run some test to make sure when PVC not found in CNS, the volume health status is set to inaccessible?

This is a rare corner case. I don't know how to reproduce this case. Any suggestions?

In the code comment, I saw "When a Datastore is removed from VC (like vSAN direct disk decommisson with noAction does)" then PVC will not be found in CNS. So is there any bug that this PR try to fix?
If it is, we may ask the QE who filed this bug to verify with this fix.

Maybe we can try to delete the FCD disk out of band to simulate the case that PVC not found in CNS?

pkg/syncer/volume_health.go Show resolved Hide resolved
pkg/syncer/volume_health.go Outdated Show resolved Hide resolved
pkg/syncer/volume_health.go Outdated Show resolved Hide resolved
@subramanian-neelakantan
Copy link
Contributor

@xing-yang Could you run some test to make sure when PVC not found in CNS, the volume health status is set to inaccessible?

This is a rare corner case. I don't know how to reproduce this case. Any suggestions?

In the code comment, I saw "When a Datastore is removed from VC (like vSAN direct disk decommisson with noAction does)" then PVC will not be found in CNS. So is there any bug that this PR try to fix?
If it is, we may ask the QE who filed this bug to verify with this fix.

Maybe we can try to delete the FCD disk out of band to simulate the case that PVC not found in CNS?

We have two cases to verify manually. Assuming that you can change the frequency of volume health poller from the default 5 mins, here are the two cases:
Case 1. Set volume health syncer interval > 15 mins so that CNS cache DB sync happens before volume health syncer.
Case 2. Set volume health syncer interval < 15 mins so that volume health syncer runs before CNS cache DB is updated.

Run tests for both cases as follows.

  • Create a PVC. Wait for its health to show as "accessible".
  • Delete the underlying FCD out-of-band. This can be done from the https://<vc_ip/vslm/mob using the VStorageObjectManager#VslmDeleteVStorageObject_Task() API.
  • Expectations in each case:
    • Case 1 (health syncer > 15 mins):
      * (Without fix) PVC remains "accessible" forever
      * (With fix) PVC goes to "inaccessible" when health syncer runs
    • Case 2 (health syncer < 15 mins):
      * (Without fix) PVC goes to "" health state and then remains that way forever
      * (With fix) PVC goes to "" health state and after 15 mins goes to "inaccessible"

@svcbot-qecnsdp
Copy link

Started WCP block pipeline...

@svcbot-qecnsdp
Copy link

Build ID: 32
WCP build status: FAILURE 
Stage before exit: null 
Jenkins E2E Test Results: 
Ran 14 of 187 Specs in 732.107 seconds
FAIL! -- 0 Passed | 14 Failed | 0 Pending | 173 Skipped
--- FAIL: TestE2E (732.24s)
FAIL

Ginkgo ran 1 suite in 12m38.978136957s
Test Suite Failed
make: Leaving directory `/home/worker/workspace/csi-wcp-pre-check-in/Results/32/vsphere-csi-driver`

@svcbot-qecnsdp
Copy link

Started WCP block pipeline...

@svcbot-qecnsdp
Copy link

Build ID: 33
WCP build status: FAILURE 
Stage before exit: null 
Jenkins E2E Test Results: 
Ran 14 of 187 Specs in 126.179 seconds
FAIL! -- 0 Passed | 14 Failed | 0 Pending | 173 Skipped
--- FAIL: TestE2E (126.30s)
FAIL

Ginkgo ran 1 suite in 2m35.553567031s
Test Suite Failed
make: Leaving directory `/home/worker/workspace/csi-wcp-pre-check-in/Results/33/vsphere-csi-driver`

@xing-yang
Copy link
Contributor Author

xing-yang commented Jun 7, 2021

Manually ran the following test:

Default volume heath sync is 2 min and full sync is also 2 min

Environment:
  CLUSTER_FLAVOR:                  WORKLOAD
  FULL_SYNC_INTERVAL_MINUTES:      2
  VOLUME_HEALTH_INTERVAL_MINUTES:  2
  POD_POLL_INTERVAL_SECONDS:       2

Created a new volume, no health status yet

kubectl describe pvc -n e2e-test-namespace

Name: block-pvc
Namespace: e2e-test-namespace
StorageClass: shared-ds-policy
Status: Bound
Volume: pvc-af778591-5423-43ea-8731-99142edb54b9
Labels:
Annotations: pv.kubernetes.io/bind-completed: yes
pv.kubernetes.io/bound-by-controller: yes
volume.beta.kubernetes.io/storage-provisioner: csi.vsphere.vmware.com
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 1Gi
Access Modes: RWO
VolumeMode: Filesystem
Used By:
Events:
Type Reason Age From Message


Normal ExternalProvisioning 25s persistentvolume-controller waiting for a volume to be created, either by external provisioner "csi.vsphere.vmware.com" or manually created by system administrator
Normal Provisioning 25s csi.vsphere.vmware.com_4220a295c5e4dcd867ee94daa3038e09_415ba9cd-0cc9-4eee-ad16-e52419fd8251 External provisioner is provisioning volume for claim "e2e-test-namespace/block-pvc"
Normal ProvisioningSucceeded 21s csi.vsphere.vmware.com_4220a295c5e4dcd867ee94daa3038e09_415ba9cd-0cc9-4eee-ad16-e52419fd8251 Successfully provisioned volume pvc-af778591-5423-43ea-8731-99142edb54b9

Health status shows up now

kubectl describe pvc -n e2e-test-namespace

Name: block-pvc
Namespace: e2e-test-namespace
StorageClass: shared-ds-policy
Status: Bound
Volume: pvc-af778591-5423-43ea-8731-99142edb54b9
Labels:
Annotations: pv.kubernetes.io/bind-completed: yes
pv.kubernetes.io/bound-by-controller: yes
volume.beta.kubernetes.io/storage-provisioner: csi.vsphere.vmware.com
volumehealth.storage.kubernetes.io/health: accessible
volumehealth.storage.kubernetes.io/health-timestamp: Mon Jun 7 19:38:08 UTC 2021
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 1Gi
Access Modes: RWO
VolumeMode: Filesystem
Used By:

Corresponding logs:

2021-06-07T19:38:08.528Z DEBUG syncer/volume_health.go:122 updateVolumeHealthStatus: update volume health annotation for pvc e2e-test-namespace/block-pvc from old value to new value accessible {"TraceId": "3f7dc2ef-843a-476a-b1cb-0ec6515b55fc"}
2021-06-07T19:38:08.528Z INFO syncer/volume_health.go:127 updateVolumeHealthStatus: set annotation for health to accessible at time Mon Jun 7 19:38:08 UTC 2021 for pvc e2e-test-namespace/block-pvc {"TraceId": "3f7dc2ef-843a-476a-b1cb-0ec6515b55fc"}

Delete FCD at 19:44:23 UTC 2021; Health status stayed as accessible until it is updated later

kubectl describe pvc -n e2e-test-namespace

Name: block-pvc
Namespace: e2e-test-namespace
StorageClass: shared-ds-policy
Status: Bound
Volume: pvc-af778591-5423-43ea-8731-99142edb54b9
Labels:
Annotations: pv.kubernetes.io/bind-completed: yes
pv.kubernetes.io/bound-by-controller: yes
volume.beta.kubernetes.io/storage-provisioner: csi.vsphere.vmware.com
volumehealth.storage.kubernetes.io/health: accessible
volumehealth.storage.kubernetes.io/health-timestamp: Mon Jun 7 19:38:08 UTC 2021
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 1Gi
Access Modes: RWO
VolumeMode: Filesystem
Used By:

Changed to “inaccessible” 4 minutes after FCD is deleted

It coincided with the ending of a FullSync period.

kubectl describe pvc -n e2e-test-namespace

Name: block-pvc
Namespace: e2e-test-namespace
StorageClass: shared-ds-policy
Status: Bound
Volume: pvc-af778591-5423-43ea-8731-99142edb54b9
Labels:
Annotations: pv.kubernetes.io/bind-completed: yes
pv.kubernetes.io/bound-by-controller: yes
volume.beta.kubernetes.io/storage-provisioner: csi.vsphere.vmware.com
volumehealth.storage.kubernetes.io/health: inaccessible
volumehealth.storage.kubernetes.io/health-timestamp: Mon Jun 7 19:48:08 UTC 2021
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 1Gi
Access Modes: RWO
VolumeMode: Filesystem
Used By:

Corresponding logs:

2021-06-07T19:48:07.994Z WARN syncer/fullsync.go:367 could not find any volume which is present in both k8s and in CNS {"TraceId": "0d895233-57fa-4639-9db9-d4201a3a8572"}

2021-06-07T19:48:07.998Z INFO syncer/fullsync.go:430 FullSync: Volume with id: "bd9536ef-6696-4383-a5ba-aefa2a56dd84" and name: "pvc-af778591-5423-43ea-8731-99142edb54b9" is added to cnsCreationMap {"TraceId": "0d895233-57fa-4639-9db9-d4201a3a8572"}

2021-06-07T19:48:08.000Z INFO syncer/fullsync.go:137 FullSync: end {"TraceId": "0d895233-57fa-4639-9db9-d4201a3a8572"}

2021-06-07T19:48:08.006Z DEBUG syncer/volume_health.go:122 updateVolumeHealthStatus: update volume health annotation for pvc e2e-test-namespace/block-pvc from old value accessible to new value inaccessible {"TraceId": "0d895233-57fa-4639-9db9-d4201a3a8572"}
2021-06-07T19:48:08.007Z INFO syncer/volume_health.go:127 updateVolumeHealthStatus: set annotation for health to inaccessible at time Mon Jun 7 19:48:08 UTC 2021 for pvc e2e-test-namespace/block-pvc {"TraceId": "0d895233-57fa-4639-9db9-d4201a3a8572"}

@xing-yang
Copy link
Contributor Author

@subramanian-neelakantan @SandeepPissay @lipingxue Comments are addressed. PTAL.

@svcbot-qecnsdp
Copy link

Started WCP block pipeline...

@svcbot-qecnsdp
Copy link

Build ID: 35
WCP build status: SUCCESS 
Stage before exit: finally 
Jenkins E2E Test Results: 

Ran 14 of 187 Specs in 3233.858 seconds
SUCCESS! -- 14 Passed | 0 Failed | 0 Pending | 173 Skipped
PASS

Ginkgo ran 1 suite in 54m25.137123102s
Test Suite Passed
make: Leaving directory `/home/worker/workspace/csi-wcp-pre-check-in/Results/35/vsphere-csi-driver`

@subramanian-neelakantan
Copy link
Contributor

Looks good to me.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: subramanian-neelakantan, xing-yang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

@SandeepPissay SandeepPissay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes looks good to me. I have few comments on the logging.

pkg/syncer/volume_health.go Outdated Show resolved Hide resolved
pkg/syncer/volume_health.go Outdated Show resolved Hide resolved
pkg/syncer/volume_health.go Outdated Show resolved Hide resolved
pkg/syncer/volume_health.go Outdated Show resolved Hide resolved
@xing-yang
Copy link
Contributor Author

@SandeepPissay comments are addressed. PTAL.

@SandeepPissay
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 8, 2021
@k8s-ci-robot k8s-ci-robot merged commit fe6484f into kubernetes-sigs:master Jun 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants