[BUG] Unable to list backups when backuptarget resource is picked up by a cordoned node #7619

derekbit · 2024-01-11T01:24:10Z

Describe the bug

Unable to list backups when backuptarget resource is picked up by a cordoned node

To Reproduce

Prepare a 3-node cluster
Cordon node-2 and node-3
Install Longhorn v1.5.3 or master-head
If backuptarget is picked up by a cordoned node, the error message is emitted when listing backups

failed to get engine client proxy: instance-manager-595f55887f0bef4e44b1846739426fd8 instance manager is in stopped, not running state

Expected behavior

Support bundle for troubleshooting

Environment

Longhorn version: v1.5.3 and master-head
Impacted volume (PV):
Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
- Number of control plane nodes in the cluster:
- Number of worker nodes in the cluster:
Node config
- OS type and version:
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:

Additional context

<!-Please add any other context about the problem here.-->

The text was updated successfully, but these errors were encountered:

c3y1huang · 2024-01-11T03:35:20Z

So far I am unable to reproduce this. Checking with @derekbit about the reproducing steps.

Below are my steps attempting to reproduce:

Cordoned all worker nodes.

> k cordon ip-10-0-2-24; k cordon ip-10-0-2-68; k cordon ip-10-0-2-181
> k get nodes
NAME            STATUS                     ROLES                  AGE    VERSION
ip-10-0-2-24    Ready,SchedulingDisabled   <none>                 7d2h   v1.27.1+k3s1
ip-10-0-2-68    Ready,SchedulingDisabled   <none>                 7d2h   v1.27.1+k3s1
ip-10-0-2-181   Ready,SchedulingDisabled   <none>                 7d2h   v1.27.1+k3s1
ip-10-0-1-145   Ready                      control-plane,master   7d2h   v1.27.1+k3s1

Checked instance-managers.

> k -n longhorn-system get lhim
NAME                                                STATE     TYPE   NODE            AGE
instance-manager-d9ac85ac5afd4cda85235f84a47d8dfd   running   aio    ip-10-0-2-68    5m47s
instance-manager-824cd89daa26c0160438e0f69c4f5b3c   running   aio    ip-10-0-2-24    5m47s
instance-manager-188cf465a00e27f2d2497f6cdd5017ec   running   aio    ip-10-0-2-181   5m44s

Checked backup.

> k -n longhorn-system get backup
NAME                      SNAPSHOTNAME                           SNAPSHOTSIZE   SNAPSHOTCREATEDAT      STATE       LASTSYNCEDAT
backup-8357898bca8d4ded   ec4dced4-961d-4d8d-a3f2-a3384fc2bd26   117440512      2023-09-15T00:53:03Z   Completed   2024-01-11T03:12:51Z

Disabled backup target.

> k -n longhorn-system get backuptargets.longhorn.io 
NAME      URL   CREDENTIAL   LASTBACKUPAT   AVAILABLE   LASTSYNCEDAT
default                      5m0s           false       2024-01-11T03:22:23Z

Checked backup.

> k -n longhorn-system get backup
No resources found in longhorn-system namespace.

Enabled backup target.

> k -n longhorn-system get backuptargets.longhorn.io 
NAME      URL                            CREDENTIAL   LASTBACKUPAT   AVAILABLE   LASTSYNCEDAT
default   s3://c3y1-s3@ap-southeast-1/   aws-secret   5m0s           true        2024-01-11T03:27:59Z

Checked backup.

> k -n longhorn-system get backup
NAME                      SNAPSHOTNAME                           SNAPSHOTSIZE   SNAPSHOTCREATEDAT      STATE       LASTSYNCEDAT
backup-8357898bca8d4ded   ec4dced4-961d-4d8d-a3f2-a3384fc2bd26   117440512      2023-09-15T00:53:03Z   Completed   2024-01-11T03:28:00Z

derekbit · 2024-01-11T03:43:20Z

Provided a cluster with the issue to @c3y1huang

My steps are

k cordon dereksu-aws-longhorn-pool1-caacdd4c-5fjh9 
k cordon dereksu-aws-longhorn-pool1-caacdd4c-67mbh
-------
(install LH v1.5.3)
helm install longhorn --namespace longhorn-system --create-namespace ./chart

k apply -f deploy/backupstores/nfs-backupstore.yaml

Update the backuptarget setting to nfs://longhorn-test-nfs-svc.default:/opt/backupstore
-------
List backups on UI

c3y1huang · 2024-01-11T04:27:48Z

Cause:

The DaemonSet controller automatically adds a set of tolerations to DaemonSet Pods

https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#taints-and-tolerations

longhorn-io-github-bot · 2024-01-12T02:20:47Z

chriscchien · 2024-01-15T03:07:53Z

Verifide pass on longhorn master (longhorn-manager a78c2a) with test steps

Can reproduce on v1.5.x. In master-head, backup can be listed when longhorn installed on a cluser which already have some worker nodes cordoned.

derekbit added kind/bug area/volume-backup-restore Volume backup restore require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Jan 11, 2024

innobead added the priority/0 Must be fixed in this release (managed by PO) label Jan 11, 2024

innobead added this to the v1.6.0 milestone Jan 11, 2024

innobead assigned c3y1huang Jan 11, 2024

innobead added backport/1.4.5 backport/1.5.4 labels Jan 11, 2024

This was referenced Jan 11, 2024

[BACKPORT][v1.4.5][BUG] Unable to list backups when backuptarget resource is picked up by a cordoned node #7620

Closed

[BACKPORT][v1.5.4][BUG] Unable to list backups when backuptarget resource is picked up by a cordoned node #7621

Closed

This was referenced Jan 12, 2024

fix(backuptarget): cordoned node controller renders the backup target unusable longhorn/longhorn-manager#2450

Merged

test(manual): cordoned node controller renders the backup target unusable longhorn/longhorn-tests#1671

Merged

c3y1huang added the require/manual-test-plan Require adding/updating manual test cases if they can't be automated label Jan 12, 2024

innobead assigned chriscchien Jan 13, 2024

derekbit mentioned this issue Jan 14, 2024

Prevent flooding warning messages longhorn/longhorn-manager#2475

Merged

innobead assigned derekbit Jan 14, 2024

chriscchien closed this as completed Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Unable to list backups when backuptarget resource is picked up by a cordoned node #7619

[BUG] Unable to list backups when backuptarget resource is picked up by a cordoned node #7619

derekbit commented Jan 11, 2024 •

edited

c3y1huang commented Jan 11, 2024

derekbit commented Jan 11, 2024

c3y1huang commented Jan 11, 2024 •

edited

longhorn-io-github-bot commented Jan 12, 2024 •

edited by c3y1huang

chriscchien commented Jan 15, 2024

[BUG] Unable to list backups when backuptarget resource is picked up by a cordoned node #7619

[BUG] Unable to list backups when backuptarget resource is picked up by a cordoned node #7619

Comments

derekbit commented Jan 11, 2024 • edited

Describe the bug

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

c3y1huang commented Jan 11, 2024

derekbit commented Jan 11, 2024

c3y1huang commented Jan 11, 2024 • edited

longhorn-io-github-bot commented Jan 12, 2024 • edited by c3y1huang

Pre Ready-For-Testing Checklist

chriscchien commented Jan 15, 2024

derekbit commented Jan 11, 2024 •

edited

c3y1huang commented Jan 11, 2024 •

edited

longhorn-io-github-bot commented Jan 12, 2024 •

edited by c3y1huang