Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] The second time PVC online expansion always fail on SLES #6076

Closed
chriscchien opened this issue Jun 7, 2023 · 9 comments
Closed

[BUG] The second time PVC online expansion always fail on SLES #6076

chriscchien opened this issue Jun 7, 2023 · 9 comments
Assignees
Labels
area/upstream Upstream related like tgt upstream library area/volume-expansion Volume expansion related component/longhorn-manager Longhorn manager (control plane) investigation-needed Need to identify the case before estimating and starting the development kind/bug reproduce/always 100% reproducible severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) wontfix
Milestone

Comments

@chriscchien
Copy link
Contributor

chriscchien commented Jun 7, 2023

Describe the bug (馃悰 if you encounter this issue)

From test_csi_mount_volume_online_expansion and can reproduce manually on SLES: Verified provider
suse-sles-15-sp4-v20230428-hvm-ssd-x86_64, (On Ubuntu, e2e and manual test all passed.)

To Reproduce

Steps to reproduce the behavior:

  1. Deploy longhorn master on SLES
  2. Dynamic provision workload with PVC, wait volume healthy.
  3. Update pvc.spec.resources to expand the volume
  4. Observe status.capacity.storage updated
  5. Update pvc.spec.resources to expand the volume again
  6. status.capacity.storage not update after 10 minutes

Expected behavior

status.capacity.storage should updated after the second time online volume expansion

Log or Support bundle

After second expansion can see error related to sync Longhorn setting longhorn-system/storage-network

2023-06-07T04:03:03.034101623Z time="2023-06-07T04:03:03Z" level=error msg="Failed to sync Longhorn setting longhorn-system/storage-network" controller=longhorn-setting error="failed to sync setting for longhorn-system/storage-network: failed to apply storage-network setting to Longhorn workloads when there are attached volumes" node=ip-172-31-95-101
2023-06-07T04:03:03.644448416Z time="2023-06-07T04:03:03Z" level=info msg="Updating log level from debug to debug"
2023-06-07T04:03:03.644573025Z time="2023-06-07T04:03:03Z" level=error msg="Failed to sync Longhorn setting longhorn-system/spdk" controller=longhorn-setting error="failed to sync setting for longhorn-system/spdk: cannot apply spdk setting to Longhorn workloads when there are attached volumes" node=ip-172-31-95-101

supportbundle_8ae26f52-def6-4d27-b88a-33b1df664bca_2023-06-07T04-06-03Z.zip

Environment

  • Longhorn version: master-head
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.27.2+k3s1
  • Node config
    • OS type and version: suse-sles-15-sp4-v20230428-hvm-ssd-x86_64
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS

Additional context

#5839

@chriscchien chriscchien added kind/bug severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) reproduce/always 100% reproducible labels Jun 7, 2023
@innobead innobead added this to the v1.5.0 milestone Jun 7, 2023
@innobead innobead added component/longhorn-manager Longhorn manager (control plane) area/volume-expansion Volume expansion related labels Jun 7, 2023
@innobead
Copy link
Member

innobead commented Jun 7, 2023

@c3y1huang Please help check to see if this is a real issue.

@chriscchien the log you provided is irrelevant. Did you see any other suspecting logs?

@innobead
Copy link
Member

innobead commented Jun 7, 2023

cc @shuo-wu

@chriscchien
Copy link
Contributor Author

In csi-resizer pod can see below

UID:"e2aa6c31-413d-4e32-8e9b-d5130d832b1e", APIVersion:"v1", ResourceVersion:"4626", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-e2aa6c31-413d-4e32-8e9b-d5130d832b1e
2023-06-07T03:22:56.921826393Z I0607 03:22:56.921677       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"csi-mount-volume-online-expansion-test-pvc", UID:"e2aa6c31-413d-4e32-8e9b-d5130d832b1e", APIVersion:"v1", ResourceVersion:"4626", FieldPath:""}): type: 'Normal' reason: 'FileSystemResizeRequired' Require file system resize of volume on node
2023-06-07T03:37:32.333300514Z I0607 03:37:32.333150       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"vol1", UID:"60597c25-dd21-4b86-acf8-78d208f69482", APIVersion:"v1", ResourceVersion:"6343", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-60597c25-dd21-4b86-acf8-78d208f69482
2023-06-07T03:37:42.466554309Z I0607 03:37:42.466387       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"vol1", UID:"60597c25-dd21-4b86-acf8-78d208f69482", APIVersion:"v1", ResourceVersion:"6343", FieldPath:""}): type: 'Normal' reason: 'FileSystemResizeRequired' Require file system resize of volume on node
2023-06-07T03:38:57.501929339Z I0607 03:38:57.501809       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"vol1", UID:"60597c25-dd21-4b86-acf8-78d208f69482", APIVersion:"v1", ResourceVersion:"6500", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-60597c25-dd21-4b86-acf8-78d208f69482
2023-06-07T03:39:07.636414583Z I0607 03:39:07.636264       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"vol1", UID:"60597c25-dd21-4b86-acf8-78d208f69482", APIVersion:"v1", ResourceVersion:"6500", FieldPath:""}): type: 'Normal' reason: 'FileSystemResizeRequired' Require file system resize of volume on node
2023-06-07T03:50:05.858423450Z I0607 03:50:05.858311       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"vol1", UID:"38ac1989-3c3d-45e8-a6eb-3fe52aca9ec8", APIVersion:"v1", ResourceVersion:"7909", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-38ac1989-3c3d-45e8-a6eb-3fe52aca9ec8
2023-06-07T03:50:15.991606120Z I0607 03:50:15.991463       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"vol1", UID:"38ac1989-3c3d-45e8-a6eb-3fe52aca9ec8", APIVersion:"v1", ResourceVersion:"7909", FieldPath:""}): type: 'Normal' reason: 'FileSystemResizeRequired' Require file system resize of volume on node
2023-06-07T03:51:25.831998726Z I0607 03:51:25.831832       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"vol1", UID:"38ac1989-3c3d-45e8-a6eb-3fe52aca9ec8", APIVersion:"v1", ResourceVersion:"8056", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-38ac1989-3c3d-45e8-a6eb-3fe52aca9ec8
2023-06-07T03:51:35.965891662Z I0607 03:51:35.965357       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"vol1", UID:"38ac1989-3c3d-45e8-a6eb-3fe52aca9ec8", APIVersion:"v1", ResourceVersion:"8056", FieldPath:""}): type: 'Normal' reason: 'FileSystemResizeRequired' Require file system resize of volume on node

longhorn-csi-plugin log

2023-06-07T03:40:04.334909504Z time="2023-06-07T03:40:04Z" level=error msg="NodeExpandVolume: err: rpc error: code = Internal desc = failed to read size of filesystem on /dev/longhorn/pvc-60597c25-dd21-4b86-acf8-78d208f69482: exit status 152: dumpe2fs 1.46.4 (18-Aug-2021)\ndumpe2fs: Superblock checksum does not match superblock while trying to open /dev/longhorn/pvc-60597c25-dd21-4b86-acf8-78d208f69482\nFilesystem volume name:   <none>\nLast mounted on:          <not available>\nFilesystem UUID:          24af8e34-b72d-4c34-b051-0f505b7b5cd0\nFilesystem magic number:  0xEF53\nFilesystem revision #:    1 (dynamic)\nFilesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum\nFilesystem flags:         signed_directory_hash \nDefault mount options:    user_xattr acl\nFilesystem state:         clean\nErrors behavior:          Continue\nFilesystem OS type:       Linux\nInode count:              131072\nBlock count:              524288\nReserved block count:     0\nOverhead clusters:        17190\nFree blocks:              507092\nFree inodes:              131061\nFirst block:              0\nBlock size:               4096\nFragment size:            4096\nGroup descriptor size:    64\nReserved GDT blocks:      127\nBlocks per group:         32768\nFragments per group:      32768\nInodes per group:         8192\nInode blocks per group:   512\nFlex block group size:    16\nFilesystem created:       Wed Jun  7 03:37:22 2023\nLast mount time:          Wed Jun  7 03:37:23 2023\nLast write time:          Wed Jun  7 03:37:23 2023\nMount count:              1\nMaximum mount count:      -1\nLast checked:             Wed Jun  7 03:37:22 2023\nCheck interval:           0 (<none>)\nLifetime writes:          533 kB\nReserved blocks uid:      0 (user root)\nReserved blocks gid:      0 (group root)\nFirst inode:              11\nInode size:\t          256\nRequired extra isize:     32\nDesired extra isize:      32\nJournal inode:            8\nDefault directory hash:   half_md4\nDirectory Hash Seed:      32ef7d78-6233-41c4-999c-4cbdf2c817bd\nJournal backup:           inode blocks\nChecksum type:            crc32c\nChecksum:                 0xff179995\nJournal features:         journal_64bit journal_checksum_v3\nTotal journal size:       32M\nTotal journal blocks:     8192\nMax transaction length:   8192\nFast commit length:       0\nJournal sequence:         0x00000004\nJournal start:            149\nJournal checksum type:    crc32c\nJournal checksum:         0xdb52b80b\n\n*** Run e2fsck now!\n\n"
2023-06-07T03:40:04.926416360Z time="2023-06-07T03:40:04Z" level=info msg="NodeExpandVolume: req: {\"capacity_range\":{\"required_bytes\":3221225472},\"staging_target_path\":\"/var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/d8621cef89dbaeacb03c1d54f9b04c71f4052bef0e71afb81c97c9c2567b6180/globalmount\",\"volume_capability\":{\"AccessType\":{\"Mount\":{\"fs_type\":\"ext4\"}},\"access_mode\":{\"mode\":1}},\"volume_id\":\"pvc-60597c25-dd21-4b86-acf8-78d208f69482\",\"volume_path\":\"/var/lib/kubelet/pods/4fffa23c-09fc-458d-ba1f-987e518860ad/volumes/kubernetes.io~csi/pvc-60597c25-dd21-4b86-acf8-78d208f69482/mount\"}"

@c3y1huang
Copy link
Contributor

c3y1huang commented Jun 9, 2023

Finding

  • sles-15-sp-3: PASS :
    • Kernel 5.3.18-59.37-default
  • sles-15-sp-4: FAIL:
    • Kernel 5.14.21-150400.24.60-default

Cause
Seems to be a known bug introduced in Kernel v5.11 by this. And later addressed in v6.2 here.

Resolution
Proposal 1: Update the Kernel version to allow pipeline to pass

Proposal 2: Change pipeline to run on sles-15-sp-3

Proposal 3: Do nothing. Known issue. The issue should self-resolve when later AMI includes the upstream fix.

@innobead innobead added investigation-needed Need to identify the case before estimating and starting the development priority/0 Must be fixed in this release (managed by PO) labels Jun 11, 2023
@innobead innobead added priority/1 Highly recommended to fix in this release (managed by PO) area/upstream Upstream related like tgt upstream library and removed priority/0 Must be fixed in this release (managed by PO) priority/1 Highly recommended to fix in this release (managed by PO) labels Jun 12, 2023
@innobead
Copy link
Member

ref: #6089

@innobead
Copy link
Member

As per discussion with @c3y1huang , this is a kernel issue and it can be fixed after upgrading the kernel to the latest patch for SLE 15.4.

The action is we don't need to do anything like longhorn/longhorn-tests#1417, but rather recognize this is a known issue for the fresh install SLE 15.4. After upgrading to 15.5, revisit here to see if the test case can run w/o issues.

cc @longhorn/qa

@c3y1huang
Copy link
Contributor

Test results here.

Closing as discussed and in favor of proposal 3.

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Jun 14, 2023

@c3y1huang Great finding. Could you please share the debugging process that you went though to find such upstream kernel bug? Thank you very much

@derekbit
Copy link
Member

@c3y1huang Great finding. Could you please share the debugging process that you went though to find such upstream kernel bug? Thank you very much

#6076 (comment)

cc @PhanLe1010

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/upstream Upstream related like tgt upstream library area/volume-expansion Volume expansion related component/longhorn-manager Longhorn manager (control plane) investigation-needed Need to identify the case before estimating and starting the development kind/bug reproduce/always 100% reproducible severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) wontfix
Projects
None yet
Development

No branches or pull requests

5 participants