Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to list backups when backuptarget resource is picked up by a cordoned node #7619

Closed
derekbit opened this issue Jan 11, 2024 · 5 comments
Assignees
Labels
area/volume-backup-restore Volume backup restore backport/1.4.5 backport/1.5.4 kind/bug priority/0 Must be fixed in this release (managed by PO) require/backport Require backport. Only used when the specific versions to backport have not been definied. require/manual-test-plan Require adding/updating manual test cases if they can't be automated require/qa-review-coverage Require QA to review coverage
Milestone

Comments

@derekbit
Copy link
Member

derekbit commented Jan 11, 2024

Describe the bug

Unable to list backups when backuptarget resource is picked up by a cordoned node

To Reproduce

  1. Prepare a 3-node cluster
  2. Cordon node-2 and node-3
  3. Install Longhorn v1.5.3 or master-head
  4. If backuptarget is picked up by a cordoned node, the error message is emitted when listing backups
failed to get engine client proxy: instance-manager-595f55887f0bef4e44b1846739426fd8 instance manager is in stopped, not running state

Expected behavior

Support bundle for troubleshooting

Environment

  • Longhorn version: v1.5.3 and master-head
  • Impacted volume (PV):
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
    • Number of control plane nodes in the cluster:
    • Number of worker nodes in the cluster:
  • Node config
    • OS type and version:
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type (e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes (Gbps):
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

<!-Please add any other context about the problem here.-->

@derekbit derekbit added kind/bug area/volume-backup-restore Volume backup restore require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Jan 11, 2024
@innobead innobead added the priority/0 Must be fixed in this release (managed by PO) label Jan 11, 2024
@innobead innobead added this to the v1.6.0 milestone Jan 11, 2024
@c3y1huang
Copy link
Contributor

So far I am unable to reproduce this. Checking with @derekbit about the reproducing steps.

Below are my steps attempting to reproduce:

  1. Cordoned all worker nodes.
> k cordon ip-10-0-2-24; k cordon ip-10-0-2-68; k cordon ip-10-0-2-181
> k get nodes
NAME            STATUS                     ROLES                  AGE    VERSION
ip-10-0-2-24    Ready,SchedulingDisabled   <none>                 7d2h   v1.27.1+k3s1
ip-10-0-2-68    Ready,SchedulingDisabled   <none>                 7d2h   v1.27.1+k3s1
ip-10-0-2-181   Ready,SchedulingDisabled   <none>                 7d2h   v1.27.1+k3s1
ip-10-0-1-145   Ready                      control-plane,master   7d2h   v1.27.1+k3s1
  1. Checked instance-managers.
> k -n longhorn-system get lhim
NAME                                                STATE     TYPE   NODE            AGE
instance-manager-d9ac85ac5afd4cda85235f84a47d8dfd   running   aio    ip-10-0-2-68    5m47s
instance-manager-824cd89daa26c0160438e0f69c4f5b3c   running   aio    ip-10-0-2-24    5m47s
instance-manager-188cf465a00e27f2d2497f6cdd5017ec   running   aio    ip-10-0-2-181   5m44s
  1. Checked backup.
> k -n longhorn-system get backup
NAME                      SNAPSHOTNAME                           SNAPSHOTSIZE   SNAPSHOTCREATEDAT      STATE       LASTSYNCEDAT
backup-8357898bca8d4ded   ec4dced4-961d-4d8d-a3f2-a3384fc2bd26   117440512      2023-09-15T00:53:03Z   Completed   2024-01-11T03:12:51Z
  1. Disabled backup target.
> k -n longhorn-system get backuptargets.longhorn.io 
NAME      URL   CREDENTIAL   LASTBACKUPAT   AVAILABLE   LASTSYNCEDAT
default                      5m0s           false       2024-01-11T03:22:23Z
  1. Checked backup.
> k -n longhorn-system get backup
No resources found in longhorn-system namespace.
  1. Enabled backup target.
> k -n longhorn-system get backuptargets.longhorn.io 
NAME      URL                            CREDENTIAL   LASTBACKUPAT   AVAILABLE   LASTSYNCEDAT
default   s3://c3y1-s3@ap-southeast-1/   aws-secret   5m0s           true        2024-01-11T03:27:59Z
  1. Checked backup.
> k -n longhorn-system get backup
NAME                      SNAPSHOTNAME                           SNAPSHOTSIZE   SNAPSHOTCREATEDAT      STATE       LASTSYNCEDAT
backup-8357898bca8d4ded   ec4dced4-961d-4d8d-a3f2-a3384fc2bd26   117440512      2023-09-15T00:53:03Z   Completed   2024-01-11T03:28:00Z

@derekbit
Copy link
Member Author

Provided a cluster with the issue to @c3y1huang

My steps are

k cordon dereksu-aws-longhorn-pool1-caacdd4c-5fjh9 
k cordon dereksu-aws-longhorn-pool1-caacdd4c-67mbh
-------
(install LH v1.5.3)
helm install longhorn --namespace longhorn-system --create-namespace ./chart

k apply -f deploy/backupstores/nfs-backupstore.yaml

Update the backuptarget setting to nfs://longhorn-test-nfs-svc.default:/opt/backupstore
-------
List backups on UI

image

@c3y1huang
Copy link
Contributor

c3y1huang commented Jan 11, 2024

Cause:

The DaemonSet controller automatically adds a set of tolerations to DaemonSet Pods

https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#taints-and-tolerations

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Jan 12, 2024

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at: issue description

  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at:

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
    The PR for the YAML change is at:
    The PR for the chart change is at:

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at fix(backuptarget): cordoned node controller renders the backup target unusable longhorn-manager#2450

  • Which areas/issues this PR might have potential impacts on?
    Area backup, manager
    Issues

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
    The LEP PR is at

  • If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?
    The UI issue/PR is at

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
    The documentation issue/PR is at

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    The automation skeleton PR is at
    The automation test case PR is at
    The issue of automation test case implementation is at (please create by the template)

  • If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?
    The engine automation PR is at

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at test(manual): cordoned node controller renders the backup target unusable longhorn-tests#1671

  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

@c3y1huang c3y1huang added the require/manual-test-plan Require adding/updating manual test cases if they can't be automated label Jan 12, 2024
@chriscchien
Copy link
Contributor

Verifide pass on longhorn master (longhorn-manager a78c2a) with test steps

Can reproduce on v1.5.x. In master-head, backup can be listed when longhorn installed on a cluser which already have some worker nodes cordoned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/volume-backup-restore Volume backup restore backport/1.4.5 backport/1.5.4 kind/bug priority/0 Must be fixed in this release (managed by PO) require/backport Require backport. Only used when the specific versions to backport have not been definied. require/manual-test-plan Require adding/updating manual test cases if they can't be automated require/qa-review-coverage Require QA to review coverage
Projects
None yet
Development

No branches or pull requests

5 participants