Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IMPROVEMENT] Blindly stop raid bdev exposure before exposing it for V2 volume #7324

Closed
derekbit opened this issue Dec 13, 2023 · 4 comments
Closed
Assignees
Labels
area/v2-data-engine v2 data engine (SPDK) kind/improvement Request for improvement of existing function priority/0 Must be fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/backport Require backport. Only used when the specific versions to backport have not been definied. require/doc Require updating the longhorn.io documentation
Milestone

Comments

@derekbit
Copy link
Member

Is your improvement request related to a feature? Please describe (馃憤 if you like this request)

In some failed cases, SPDK engine raises an error nvmf ... subsystem already exists while attempting to create a new subsystem for a RAID bdev. This is due to orphaned resources remaining undeleted after encountering an error.
To prevent this error message, we can blindly stop raid bdev exposure for cleaning up the orphaned resources before exposing it.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

@derekbit derekbit added require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/doc Require updating the longhorn.io documentation kind/improvement Request for improvement of existing function require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Dec 13, 2023
@derekbit derekbit changed the title [IMPROVEMENT] Blindly stop raid bdev exposure before exposing it [IMPROVEMENT] Blindly stop raid bdev exposure before exposing it for V2 volume Dec 13, 2023
@derekbit derekbit added the area/v2-data-engine v2 data engine (SPDK) label Dec 13, 2023
@derekbit derekbit self-assigned this Dec 13, 2023
@innobead innobead added this to the v1.6.0 milestone Dec 13, 2023
@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Dec 13, 2023

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:

This for the resilience.
Check v2 volume attachment and detachment work expectedly.

@derekbit derekbit added the priority/0 Must be fixed in this release (managed by PO) label Dec 13, 2023
@yangchiu
Copy link
Member

Tested on master-head (longhorn-manager a42748a) following https://longhorn.io/docs/1.5.3/spdk/quick-start/ to setup v2 environment and create v2 volume. At steps Create a StorageClass and Create Longhorn Volumes, no v2 volume can be created:

$ kubectl describe pod volume-test 
Name:         volume-test
Namespace:    default
Priority:     0
Node:         <none>
Labels:       <none>
Annotations:  <none>
Status:       Pending
IP:           
IPs:          <none>
Containers:
  volume-test:
    Image:        nginx:stable-alpine
    Port:         80/TCP
    Host Port:    0/TCP
    Liveness:     exec [ls /data/lost+found] delay=5s timeout=1s period=5s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /data from volv (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-g6rvb (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  volv:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  longhorn-volv-pvc
    ReadOnly:   false
  kube-api-access-g6rvb:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  6m    default-scheduler  0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod..
  Warning  FailedScheduling  43s   default-scheduler  0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod..
$ kubectl describe pvc longhorn-volv-pvc
Name:          longhorn-volv-pvc
Namespace:     default
StorageClass:  longhorn-v2-data-engine
Status:        Pending
Volume:        
Labels:        <none>
Annotations:   volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
               volume.kubernetes.io/storage-provisioner: driver.longhorn.io
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      
Access Modes:  
VolumeMode:    Filesystem
Used By:       volume-test
Events:
  Type     Reason                Age                   From                                                                                      Message
  ----     ------                ----                  ----                                                                                      -------
  Warning  ProvisioningFailed    2m22s                 driver.longhorn.io_csi-provisioner-7b95bf4b87-wnbtg_f759a4a2-f551-4bb0-8891-751410564d54  failed to provision volume with StorageClass "longhorn-v2-data-engine": rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [detail=, message=failed to create volume: unable to create volume pvc-0bc4eed8-f5fc-45a6-80ad-d10c543edd7c: admission webhook "validator.longhorn.io" denied the request: cannot find engine image by longhornio/longhorn-instance-manager:master-head, code=Internal Server Error] from [http://longhorn-backend:9500/v1/volumes]
  Warning  ProvisioningFailed    82s (x3 over 2m25s)   driver.longhorn.io_csi-provisioner-7b95bf4b87-wnbtg_f759a4a2-f551-4bb0-8891-751410564d54  failed to provision volume with StorageClass "longhorn-v2-data-engine": rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [message=failed to create volume: unable to create volume pvc-0bc4eed8-f5fc-45a6-80ad-d10c543edd7c: admission webhook "validator.longhorn.io" denied the request: cannot find engine image by longhornio/longhorn-instance-manager:master-head, code=Internal Server Error, detail=] from [http://longhorn-backend:9500/v1/volumes]
  Normal   Provisioning          18s (x8 over 2m25s)   driver.longhorn.io_csi-provisioner-7b95bf4b87-wnbtg_f759a4a2-f551-4bb0-8891-751410564d54  External provisioner is provisioning volume for claim "default/longhorn-volv-pvc"
  Warning  ProvisioningFailed    18s (x4 over 2m24s)   driver.longhorn.io_csi-provisioner-7b95bf4b87-wnbtg_f759a4a2-f551-4bb0-8891-751410564d54  failed to provision volume with StorageClass "longhorn-v2-data-engine": rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [code=Internal Server Error, detail=, message=failed to create volume: unable to create volume pvc-0bc4eed8-f5fc-45a6-80ad-d10c543edd7c: admission webhook "validator.longhorn.io" denied the request: cannot find engine image by longhornio/longhorn-instance-manager:master-head] from [http://longhorn-backend:9500/v1/volumes]
  Normal   ExternalProvisioning  14s (x11 over 2m25s)  persistentvolume-controller                                                               waiting for a volume to be created, either by external provisioner "driver.longhorn.io" or manually created by system administrator

supportbundle_2fef90e8-dede-4414-80df-df294eb6e174_2023-12-27T01-35-31Z.zip

cc @derekbit

@derekbit
Copy link
Member Author

Tested on master-head (longhorn-manager a42748a) following https://longhorn.io/docs/1.5.3/spdk/quick-start/ to setup v2 environment and create v2 volume. At steps Create a StorageClass and Create Longhorn Volumes, no v2 volume can be created:

It is caused by #5842 feature. Fixed in longhorn/longhorn-manager@14a841d.

@yangchiu
Copy link
Member

Verified passed on master-head (longhorn-manager be08850). v2 volume attachment and detachment work expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/v2-data-engine v2 data engine (SPDK) kind/improvement Request for improvement of existing function priority/0 Must be fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/backport Require backport. Only used when the specific versions to backport have not been definied. require/doc Require updating the longhorn.io documentation
Projects
None yet
Development

No branches or pull requests

5 participants