[BUG] Minior typo in a lhv yaml stop entire cluster from working #2423

ElisaMeng · 2021-03-30T09:48:10Z

Describe the bug
A clear and concise description of what the bug is.
longhorn manager failed to startup

kubectl -n longhorn-system logs -f longhorn-manager-zsq7d
time="2021-03-30T09:45:27Z" level=info msg="Start overwriting built-in settings with customized values"
W0330 09:45:27.590921       1 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2021-03-30T09:45:27Z" level=info msg="cannot list the content of the src directory /var/lib/rancher/longhorn/engine-binaries for the copy, will do nothing: Failed to execute: nsenter [--mount=/host/proc/1/ns/mnt --net=/host/proc/1/ns/net bash -c ls /var/lib/rancher/longhorn/engine-binaries/*], output , stderr, ls: cannot access '/var/lib/rancher/longhorn/engine-binaries/*': No such file or directory\n, error exit status 2"
I0330 09:45:27.617993       1 leaderelection.go:241] attempting to acquire leader lease  longhorn-system/longhorn-manager-upgrade-lock...
time="2021-03-30T09:45:27Z" level=info msg="New upgrade leader elected: master1"
time="2021-03-30T09:45:55Z" level=info msg="New upgrade leader elected: node7"
time="2021-03-30T09:46:10Z" level=info msg="New upgrade leader elected: node8"

To Reproduce
Steps to reproduce the behavior:

Go to 'kubectl -n longhorn-system rollout restart daemonset longhorn-manager'
longhorn-manager pod keep electing leader forever

Expected behavior
A clear and concise description of what you expected to happen.

Log
If applicable, add the Longhorn managers' log when the issue happens.

You can also attach a Support Bundle here. You can generate a Support Bundle using the link at the footer of the Longhorn UI.

Environment:

Longhorn version: 1.1.0
Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s v1.19.8
- Number of management node in the cluster:
- Number of worker node in the cluster:
Node config
- OS type and version:
- CPU per node:
- Memory per node:
- Disk type(e.g. SSD/NVMe):
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

ElisaMeng · 2021-03-30T10:19:50Z

finally it passed election, but crash quickly with:

level=error msg="Upgrade failed: upgrade Pods failed: upgrade from v1.0.2 to v1.1.0: upgrade volume failed: upgrade from v1.0.2 to v1.1.0: failed to list all existing Longhorn volumes during the volume upgrade: v1beta1.VolumeList.Items: []v1beta1.Volume: v1beta1.Volume.Spec: types.VolumeSpec.RecurringJobs: []types.RecurringJob: types.RecurringJob.Labels: ReadMapCB: expect { or n, but found \", error found in #10 byte of ...|\"labels\":\"weekly\",\"n|..., bigger context ...|:3,\"recurringJobs\":[{\"cron\":\"0 1 * * 6\",\"labels\":\"weekly\",\"name\":\"backup\",\"retain\":3,\"task\":\"backup\"|..."
time="2021-03-30T10:15:08Z" level=info msg="Upgrade leader lost: master"

engine image was deployed early, but since no volume use it anymore, it get clean up and removed for some reason.

ElisaMeng · 2021-03-30T11:03:09Z

Problem addressed, there is a lhv has recurring backup configured, and it seem there is typo in it. this minor typo shutdown entire cluster! This is unacceptable for a distributed system, I would say.

innobead · 2021-03-30T12:35:48Z

Problem addressed, there is a lhv has recurring backup configured, and it seem there is typo in it. this minor typo shutdown entire cluster! This is unacceptable for a distributed system, I would say.

Thanks for raising this issue.

Could you help update the reproducible steps? also, provide the support bundle to help quickly identify the cause. Thanks.

ElisaMeng · 2021-03-30T14:38:54Z

Steps to reproduce:

kubectl -n longhorn-system edit lhv
put it some valid yaml syntax, but not friendly to longhorn
add a new node to cluster
longhorn not possible to attach any lhv
restart longhorn-manager daemonset.
no manager can be ready, you cluster is gone. :(

Might need two engine image in the cluster since it is this line stop longhorn-manager to startup

level=error msg="Upgrade failed: upgrade Pods failed: upgrade from v1.0.2 to v1.1.0: upgrade volume failed: upgrade from v1.0.2 to v1.1.0: failed to list all existing Longhorn volumes during the volume upgrade: v1beta1.VolumeList.Items: []v1beta1.Volume: v1beta1.Volume.Spec: types.VolumeSpec.RecurringJobs: []types.RecurringJob: types.RecurringJob.Labels: ReadMapCB: expect { or n, but found \", error found in #10 byte of ...|\"labels\":\"weekly\",\"n|..., bigger context ...|:3,\"recurringJobs\":[{\"cron\":\"0 1 * * 6\",\"labels\":\"weekly\",\"name\":\"backup\",\"retain\":3,\"task\":\"backup\"|..."

c3y1huang · 2021-03-31T03:40:39Z

put it some valid yaml syntax, but not friendly to longhorn

Can you share the yaml you've used for RecurringJobs?

ElisaMeng · 2021-03-31T15:20:27Z


apiVersion: longhorn.io/v1beta1
kind: Volume
metadata:
  finalizers:
  - longhorn.io
  generation: 6
  labels:
    longhornvolume: pvc-220de6e6-014f-4675-8a0b-f9fde19bf187
    manager: longhorn-manager
  name: pvc-220de6e6-014f-4675-8a0b-f9fde19bf187
  namespace: longhorn-system
spec:
  Standby: false
  accessMode: rwo
  baseImage: ""
  dataLocality: disabled
  disableFrontend: false
  diskSelector: null
  engineImage: longhornio/longhorn-engine:v1.1.0
  fromBackup: ""
  frontend: blockdev
  lastAttachedBy: ""
  nodeID: node15
  nodeSelector: null
  numberOfReplicas: 2
  recurringJobs:
  - cron: 0 3 * * 6
    labels: null
    name: snap
    retain: 3
    task: snapshot
 - cron: 0 3 * * 6
    labels: weekly
    name: backup
    retain: 3
    task: backup  
revisionCounterDisabled: false
  size: "300003885056"
  staleReplicaTimeout: 2880

I do a kubectl edit lhv, and as far as I remember, it is something like above.

ElisaMeng · 2021-03-31T16:48:35Z

@c3y1huang I don't think the main issue here is to reproduce it, but to review the over architecture here. An error in an individual volume shall not prevent the entire cluster from working. What do you say @joshimoo @yasker ?

PhanLe1010 · 2021-04-02T01:16:33Z

The error comes from this line which prevents the Longhorn manager pods from starting.

In order to prevent the problem that users accidentally update Longhorn CRs with invalid values and thus nuke down the Longhorn system, we can use schema validation for the Longhorn CRDs:
https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#validation

jenting · 2021-04-02T02:36:11Z

The error comes from this line which prevents the Longhorn manager pods from starting.

In order to prevent the problem that users accidentally update Longhorn CRs with invalid values and thus nuke down the Longhorn system, we can use schema validation for the Longhorn CRDs:
https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#validation

Reference to #604, we need to improve CRD to have structural schemas and ValidatingAdmissionWebhook.

joshimoo · 2021-04-02T21:03:00Z

Just leaving a note here, we were planning to look at schema validation as part of the api refactor issue:
#791

jenting · 2021-12-21T07:49:08Z

We have addressed this issue by CRD structural schema.
What we left is adding the validating admission webhook for the Volume CRs.

jenting · 2022-02-25T01:14:42Z

We have addressed this issue by CRD structural schema. What we left is adding the validating admission webhook for the Volume CRs.

#3562

mantissahz · 2023-07-12T07:04:07Z

@innobead @longhorn/qa
After validating admission webhook for the Volume/LH resource CRs and deprecated volume spec recurringJobs and storageClass recurringJobs field, this issue should be solved.
WDYT?

innobead · 2023-07-12T09:58:16Z

Agreed. Let's see if any issues related to validation later.

ElisaMeng changed the title ~~[BUG] longhorn manager failed to elect leader~~ [BUG] Minior typo in a lhv yaml stop entire cluster from working Mar 30, 2021

innobead added kind/bug component/longhorn-manager Longhorn manager (control plane) labels Mar 30, 2021

innobead added this to New in Community Issue Review via automation Mar 30, 2021

ElisaMeng mentioned this issue Mar 31, 2021

[BUG] longhorn-manager breaks after deleting disk (invalid memory address or nil pointer dereference) #2430

Closed

joshimoo mentioned this issue Apr 1, 2021

[FEATURE] Record recurring schedule in the backups and allow user choose to use it for the restored volume #2227

Closed

PhanLe1010 moved this from New to Backlog Candidates in Community Issue Review Apr 2, 2021

innobead added this to the v1.1.2 milestone Apr 5, 2021

innobead moved this from Backlog Candidates to Resolved/Scheduled in Community Issue Review Apr 5, 2021

innobead added the severity/3 Function working but has a major issue w/ workaround label Apr 5, 2021

innobead added the kind/refactoring Request for refactoring (code) label Apr 28, 2021

innobead modified the milestones: v1.1.2, v1.2.0 Apr 29, 2021

innobead modified the milestones: v1.2.0, v1.3.0 Aug 12, 2021

innobead added the area/api Longhorn manager public API label Oct 20, 2021

jenting mentioned this issue Dec 21, 2021

[FEATURE] Mutating and Validating admission webhook #3241

Closed

innobead removed this from the v1.3.0 milestone Apr 8, 2022

innobead added this to the v1.4.0 milestone Apr 8, 2022

innobead assigned mantissahz Jul 7, 2022

innobead modified the milestones: v1.4.0, v1.5.0 Nov 7, 2022

innobead modified the milestones: v1.5.0, v1.6.0 May 3, 2023

innobead closed this as not planned Won't fix, can't repro, duplicate, stale Jul 12, 2023

innobead added the invalid label Jul 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Minior typo in a lhv yaml stop entire cluster from working #2423

[BUG] Minior typo in a lhv yaml stop entire cluster from working #2423

ElisaMeng commented Mar 30, 2021 •

edited

ElisaMeng commented Mar 30, 2021 •

edited

ElisaMeng commented Mar 30, 2021 •

edited

innobead commented Mar 30, 2021

ElisaMeng commented Mar 30, 2021 •

edited

c3y1huang commented Mar 31, 2021

ElisaMeng commented Mar 31, 2021

ElisaMeng commented Mar 31, 2021

PhanLe1010 commented Apr 2, 2021

jenting commented Apr 2, 2021

joshimoo commented Apr 2, 2021

jenting commented Dec 21, 2021

jenting commented Feb 25, 2022

mantissahz commented Jul 12, 2023

innobead commented Jul 12, 2023

[BUG] Minior typo in a lhv yaml stop entire cluster from working #2423

[BUG] Minior typo in a lhv yaml stop entire cluster from working #2423

Comments

ElisaMeng commented Mar 30, 2021 • edited

ElisaMeng commented Mar 30, 2021 • edited

ElisaMeng commented Mar 30, 2021 • edited

innobead commented Mar 30, 2021

ElisaMeng commented Mar 30, 2021 • edited

c3y1huang commented Mar 31, 2021

ElisaMeng commented Mar 31, 2021

ElisaMeng commented Mar 31, 2021

PhanLe1010 commented Apr 2, 2021

jenting commented Apr 2, 2021

joshimoo commented Apr 2, 2021

jenting commented Dec 21, 2021

jenting commented Feb 25, 2022

mantissahz commented Jul 12, 2023

innobead commented Jul 12, 2023

ElisaMeng commented Mar 30, 2021 •

edited

ElisaMeng commented Mar 30, 2021 •

edited

ElisaMeng commented Mar 30, 2021 •

edited

ElisaMeng commented Mar 30, 2021 •

edited