Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Minior typo in a lhv yaml stop entire cluster from working #2423

Closed
ElisaMeng opened this issue Mar 30, 2021 · 14 comments
Closed

[BUG] Minior typo in a lhv yaml stop entire cluster from working #2423

ElisaMeng opened this issue Mar 30, 2021 · 14 comments
Assignees
Labels
area/api Longhorn manager public API component/longhorn-manager Longhorn manager (control plane) invalid kind/bug kind/refactoring Request for refactoring (code) severity/3 Function working but has a major issue w/ workaround
Milestone

Comments

@ElisaMeng
Copy link

ElisaMeng commented Mar 30, 2021

Describe the bug
A clear and concise description of what the bug is.
longhorn manager failed to startup

kubectl -n longhorn-system logs -f longhorn-manager-zsq7d
time="2021-03-30T09:45:27Z" level=info msg="Start overwriting built-in settings with customized values"
W0330 09:45:27.590921       1 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2021-03-30T09:45:27Z" level=info msg="cannot list the content of the src directory /var/lib/rancher/longhorn/engine-binaries for the copy, will do nothing: Failed to execute: nsenter [--mount=/host/proc/1/ns/mnt --net=/host/proc/1/ns/net bash -c ls /var/lib/rancher/longhorn/engine-binaries/*], output , stderr, ls: cannot access '/var/lib/rancher/longhorn/engine-binaries/*': No such file or directory\n, error exit status 2"
I0330 09:45:27.617993       1 leaderelection.go:241] attempting to acquire leader lease  longhorn-system/longhorn-manager-upgrade-lock...
time="2021-03-30T09:45:27Z" level=info msg="New upgrade leader elected: master1"
time="2021-03-30T09:45:55Z" level=info msg="New upgrade leader elected: node7"
time="2021-03-30T09:46:10Z" level=info msg="New upgrade leader elected: node8"

To Reproduce
Steps to reproduce the behavior:

  1. Go to 'kubectl -n longhorn-system rollout restart daemonset longhorn-manager'
  2. longhorn-manager pod keep electing leader forever

Expected behavior
A clear and concise description of what you expected to happen.

Log
If applicable, add the Longhorn managers' log when the issue happens.

You can also attach a Support Bundle here. You can generate a Support Bundle using the link at the footer of the Longhorn UI.

Environment:

  • Longhorn version: 1.1.0
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s v1.19.8
    • Number of management node in the cluster:
    • Number of worker node in the cluster:
  • Node config
    • OS type and version:
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context
Add any other context about the problem here.

@ElisaMeng
Copy link
Author

ElisaMeng commented Mar 30, 2021

finally it passed election, but crash quickly with:

level=error msg="Upgrade failed: upgrade Pods failed: upgrade from v1.0.2 to v1.1.0: upgrade volume failed: upgrade from v1.0.2 to v1.1.0: failed to list all existing Longhorn volumes during the volume upgrade: v1beta1.VolumeList.Items: []v1beta1.Volume: v1beta1.Volume.Spec: types.VolumeSpec.RecurringJobs: []types.RecurringJob: types.RecurringJob.Labels: ReadMapCB: expect { or n, but found \", error found in #10 byte of ...|\"labels\":\"weekly\",\"n|..., bigger context ...|:3,\"recurringJobs\":[{\"cron\":\"0 1 * * 6\",\"labels\":\"weekly\",\"name\":\"backup\",\"retain\":3,\"task\":\"backup\"|..."
time="2021-03-30T10:15:08Z" level=info msg="Upgrade leader lost: master"

engine image was deployed early, but since no volume use it anymore, it get clean up and removed for some reason.

@ElisaMeng
Copy link
Author

ElisaMeng commented Mar 30, 2021

Problem addressed, there is a lhv has recurring backup configured, and it seem there is typo in it. this minor typo shutdown entire cluster! This is unacceptable for a distributed system, I would say.

@ElisaMeng ElisaMeng changed the title [BUG] longhorn manager failed to elect leader [BUG] Minior typo in a lhv yaml stop entire cluster from working Mar 30, 2021
@innobead innobead added kind/bug component/longhorn-manager Longhorn manager (control plane) labels Mar 30, 2021
@innobead innobead added this to New in Community Issue Review via automation Mar 30, 2021
@innobead
Copy link
Member

Problem addressed, there is a lhv has recurring backup configured, and it seem there is typo in it. this minor typo shutdown entire cluster! This is unacceptable for a distributed system, I would say.

Thanks for raising this issue.

Could you help update the reproducible steps? also, provide the support bundle to help quickly identify the cause. Thanks.

@ElisaMeng
Copy link
Author

ElisaMeng commented Mar 30, 2021

Steps to reproduce:

  • kubectl -n longhorn-system edit lhv
  • put it some valid yaml syntax, but not friendly to longhorn
  • add a new node to cluster
  • longhorn not possible to attach any lhv
  • restart longhorn-manager daemonset.
  • no manager can be ready, you cluster is gone. :(

Might need two engine image in the cluster since it is this line stop longhorn-manager to startup

level=error msg="Upgrade failed: upgrade Pods failed: upgrade from v1.0.2 to v1.1.0: upgrade volume failed: upgrade from v1.0.2 to v1.1.0: failed to list all existing Longhorn volumes during the volume upgrade: v1beta1.VolumeList.Items: []v1beta1.Volume: v1beta1.Volume.Spec: types.VolumeSpec.RecurringJobs: []types.RecurringJob: types.RecurringJob.Labels: ReadMapCB: expect { or n, but found \", error found in #10 byte of ...|\"labels\":\"weekly\",\"n|..., bigger context ...|:3,\"recurringJobs\":[{\"cron\":\"0 1 * * 6\",\"labels\":\"weekly\",\"name\":\"backup\",\"retain\":3,\"task\":\"backup\"|..."

@c3y1huang
Copy link
Contributor

  • put it some valid yaml syntax, but not friendly to longhorn

Can you share the yaml you've used for RecurringJobs?

@ElisaMeng
Copy link
Author


apiVersion: longhorn.io/v1beta1
kind: Volume
metadata:
  finalizers:
  - longhorn.io
  generation: 6
  labels:
    longhornvolume: pvc-220de6e6-014f-4675-8a0b-f9fde19bf187
    manager: longhorn-manager
  name: pvc-220de6e6-014f-4675-8a0b-f9fde19bf187
  namespace: longhorn-system
spec:
  Standby: false
  accessMode: rwo
  baseImage: ""
  dataLocality: disabled
  disableFrontend: false
  diskSelector: null
  engineImage: longhornio/longhorn-engine:v1.1.0
  fromBackup: ""
  frontend: blockdev
  lastAttachedBy: ""
  nodeID: node15
  nodeSelector: null
  numberOfReplicas: 2
  recurringJobs:
  - cron: 0 3 * * 6
    labels: null
    name: snap
    retain: 3
    task: snapshot
 - cron: 0 3 * * 6
    labels: weekly
    name: backup
    retain: 3
    task: backup  
revisionCounterDisabled: false
  size: "300003885056"
  staleReplicaTimeout: 2880

I do a kubectl edit lhv, and as far as I remember, it is something like above.

@ElisaMeng
Copy link
Author

@c3y1huang I don't think the main issue here is to reproduce it, but to review the over architecture here. An error in an individual volume shall not prevent the entire cluster from working. What do you say @joshimoo @yasker ?

@PhanLe1010
Copy link
Contributor

The error comes from this line which prevents the Longhorn manager pods from starting.

In order to prevent the problem that users accidentally update Longhorn CRs with invalid values and thus nuke down the Longhorn system, we can use schema validation for the Longhorn CRDs:
https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#validation

@PhanLe1010 PhanLe1010 moved this from New to Backlog Candidates in Community Issue Review Apr 2, 2021
@jenting
Copy link
Contributor

jenting commented Apr 2, 2021

The error comes from this line which prevents the Longhorn manager pods from starting.

In order to prevent the problem that users accidentally update Longhorn CRs with invalid values and thus nuke down the Longhorn system, we can use schema validation for the Longhorn CRDs:
https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#validation

Reference to #604, we need to improve CRD to have structural schemas and ValidatingAdmissionWebhook.

@joshimoo
Copy link
Contributor

joshimoo commented Apr 2, 2021

Just leaving a note here, we were planning to look at schema validation as part of the api refactor issue:
#791

@innobead innobead added this to the v1.1.2 milestone Apr 5, 2021
@innobead innobead moved this from Backlog Candidates to Resolved/Scheduled in Community Issue Review Apr 5, 2021
@innobead innobead added the severity/3 Function working but has a major issue w/ workaround label Apr 5, 2021
@innobead innobead added the kind/refactoring Request for refactoring (code) label Apr 28, 2021
@innobead innobead modified the milestones: v1.1.2, v1.2.0 Apr 29, 2021
@innobead innobead modified the milestones: v1.2.0, v1.3.0 Aug 12, 2021
@innobead innobead added the area/api Longhorn manager public API label Oct 20, 2021
@jenting
Copy link
Contributor

jenting commented Dec 21, 2021

We have addressed this issue by CRD structural schema.
What we left is adding the validating admission webhook for the Volume CRs.

@jenting
Copy link
Contributor

jenting commented Feb 25, 2022

We have addressed this issue by CRD structural schema. What we left is adding the validating admission webhook for the Volume CRs.

#3562

@innobead innobead removed this from the v1.3.0 milestone Apr 8, 2022
@innobead innobead added this to the v1.4.0 milestone Apr 8, 2022
@innobead innobead modified the milestones: v1.4.0, v1.5.0 Nov 7, 2022
@innobead innobead modified the milestones: v1.5.0, v1.6.0 May 3, 2023
@mantissahz
Copy link
Contributor

@innobead @longhorn/qa
After validating admission webhook for the Volume/LH resource CRs and deprecated volume spec recurringJobs and storageClass recurringJobs field, this issue should be solved.
WDYT?

@innobead innobead closed this as not planned Won't fix, can't repro, duplicate, stale Jul 12, 2023
@innobead
Copy link
Member

Agreed. Let's see if any issues related to validation later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Longhorn manager public API component/longhorn-manager Longhorn manager (control plane) invalid kind/bug kind/refactoring Request for refactoring (code) severity/3 Function working but has a major issue w/ workaround
Projects
Archived in project
Community Issue Review
Resolved/Scheduled
Development

No branches or pull requests

7 participants