Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Duplicated default instance manager leads to engine/replica cannot be started #3000

Closed
timmy59100 opened this issue Sep 13, 2021 · 34 comments
Assignees
Labels
backport/1.2.7 backport/1.3.3 component/longhorn-instance-manager Longhorn instance manager (interface between control and data plane) component/longhorn-manager Longhorn manager (control plane) investigation-needed Need to identify the case before estimating and starting the development kind/bug priority/1 Highly recommended to fix in this release (managed by PO)
Milestone

Comments

@timmy59100
Copy link

timmy59100 commented Sep 13, 2021

Describe the bug
All existing rwx are not attaching anymore after upgrading to 1.2.

To Reproduce
Existing volumes are not attaching to redeployed pods. Not even after setting the workload to zero. Restarted longhorn components and longhorn nodes.

Expected behavior
Volumes should attach.

Log
If applicable, add the Longhorn managers' log when the issue happens.
sent longhorn bundle

AttachVolume.Attach failed for volume "pvc-93aad038-6dda-482f-a8f6-d237a0414561" : rpc error: code = DeadlineExceeded desc = volume pvc-93aad038-6dda-482f-a8f6-d237a0414561 failed to attach to node pax-p-95

Environment:

  • Longhorn version: 1.2
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher Catalog App
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: rancher kubernets v1.20.10
    • Number of management node in the cluster: 3
    • Number of worker node in the cluster: 3
  • Node config
    • OS type and version: Ubuntu 20.04.3 LTS
    • CPU per node: 12
    • Memory per node: 32G
    • Disk type(e.g. SSD/NVMe):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Xen
  • Number of Longhorn volumes in the cluster: 80

Additional context
Add any other context about the problem here.

@yasker
Copy link
Member

yasker commented Sep 13, 2021

Are those using xfs as filesystem? cc @joshimoo

@timmy59100
Copy link
Author

We did not specify the filesystem type in the storageclass, so according to #2991 it should be ext4

@joshimoo
Copy link
Contributor

joshimoo commented Sep 13, 2021

@timmy59100 thanks for creating this issue, can you sent us a support bundle to longhorn-support-bundle@Suse.com referencing this issue.

Can you also describe the behavior on the cluster, is the share-manager pod in the longhorn-system namespace getting continuously recreated?

@joshimoo joshimoo added the area/volume-rwx Volume RWX related label Sep 13, 2021
@joshimoo joshimoo added this to New in Community Issue Review via automation Sep 13, 2021
@joshimoo joshimoo moved this from New to Pending user response in Community Issue Review Sep 13, 2021
@joshimoo joshimoo self-assigned this Sep 13, 2021
@PhanLe1010
Copy link
Contributor

The support bundled was sent over email

@timmy59100
Copy link
Author

@joshimoo No, the share-managers stays up with the following log constantly:

time="2021-09-13T20:52:15Z" level=warning msg="waiting with nfs server start, volume is not attached" encrypted=false volume=pvc-93aad038-6dda-482f-a8f6-d237a0414561

@joshimoo
Copy link
Contributor

Okay just checked the support bundle, this is unrelated to RWX and is similar to this issue:
#2992

@joshimoo joshimoo added component/longhorn-manager Longhorn manager (control plane) component/longhorn-instance-manager Longhorn instance manager (interface between control and data plane) and removed area/volume-rwx Volume RWX related labels Sep 13, 2021
@joshimoo joshimoo changed the title [BUG] 1.2 all existing RWX volumes are not attaching anymore [BUG] 1.2 manager fails to find existing IM for engine Sep 13, 2021
@PhanLe1010
Copy link
Contributor

@timmy59100 Can you try to delete this instance manager kubectl delete instancemanager instance-manager-e-8e53eeae -n longhorn-system ?

@timmy59100
Copy link
Author

@PhanLe1010 The one volume got attached after deleting the instancemanager. When I try to start the other workloads they are stuck attaching still.

@timmy59100
Copy link
Author

After waiting a bit every volume got attached.

@PhanLe1010
Copy link
Contributor

Can we have a new support bundle?

@timmy59100
Copy link
Author

sent over email.

@PhanLe1010
Copy link
Contributor

Which volume is stuck in attaching?

@timmy59100
Copy link
Author

timmy59100 commented Sep 13, 2021

After waiting a bit every volume got attached.

How did you know which instancemanager to kill?

@PhanLe1010
Copy link
Contributor

Sorry, I missed the earlier message.

Apparently, it is because there are multiple instance managers with the same spec (nodeID and image) in the cluster so Longhorn manager is confused and don't know which instance manager to pick to start engine/replica: https://github.com/longhorn/longhorn-manager/blob/5da6ccfbba0c4ae441c6b5101ace541c42271f81/datastore/longhorn.go#L2435

I just tell you to delete one of the extra instance manager among the ones that have the same spec.

I am looking into how it is possible to have multiple instance managers with the same spec but I don't know why yet. From code perspective, there shouldn't have multiple instance managers with the same spec: https://github.com/longhorn/longhorn-manager/blob/5da6ccfbba0c4ae441c6b5101ace541c42271f81/controller/node_controller.go#L851-L896

@PhanLe1010 PhanLe1010 added the investigation-needed Need to identify the case before estimating and starting the development label Sep 13, 2021
@PhanLe1010 PhanLe1010 moved this from Pending user response to Team review required in Community Issue Review Sep 13, 2021
@PhanLe1010
Copy link
Contributor

@timmy59100 Is there any abnormal event that you observed when you upgrading Longhorn? For example, node reboot, ETCD health problem, etc ...

@timmy59100
Copy link
Author

@PhanLe1010 It wasn't just after the upgrade. I deleted all instance managers to try to resolve this when the cluster was in a healthy state. But apparently the multiple instance managers with the same spec happened again.

@PhanLe1010
Copy link
Contributor

@timmy59100 Do you mean that if you delete all instance manager now, the problem of multiple instance managers with the same spec will happen?

@PhanLe1010
Copy link
Contributor

I am trying to reproduce the issue so we can find root cause and fix in the next release

@timmy59100
Copy link
Author

I don't know for sure if it happens, but deleting all instance managers would crash all workloads?

@innobead innobead modified the milestones: Backlog, v1.4.0 Mar 25, 2022
@PhanLe1010 PhanLe1010 changed the title [BUG] 1.2 manager fails to find existing IM for engine [BUG] 1.2 Duplicated default instance manager leads to engine/replica cannot be started Mar 25, 2022
@innobead innobead added priority/0 Must be fixed in this release (managed by PO) and removed duplicated labels Jul 11, 2022
@innobead innobead added priority/1 Highly recommended to fix in this release (managed by PO) and removed priority/0 Must be fixed in this release (managed by PO) labels Nov 8, 2022
@innobead innobead assigned mantissahz and c3y1huang and unassigned c3y1huang and mantissahz Nov 30, 2022
@innobead innobead changed the title [BUG] 1.2 Duplicated default instance manager leads to engine/replica cannot be started [BUG] Duplicated default instance manager leads to engine/replica cannot be started Dec 7, 2022
@innobead innobead assigned PhanLe1010 and unassigned mantissahz Dec 9, 2022
@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Dec 10, 2022

From the code flow, it looks to me that multiple default instance managers creation can only happen if this function is run concurrently by different goroutines/processes. The workqueue mechanism in client-go guarantees that there is no 2 go-routines processing the same item simultaneously. This leads me to the theory that maybe there were multiple longhorn manager pods running on the same node at the time. Not 100% sure.

Anyways, the proposed fix is to hash the node name + image name to create a name for the default instance manager. This makes it impossible to create multiple default instance managers on the same node because they would have the same name and ETCD would reject the later request.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Dec 10, 2022

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at: [BUG] Duplicated default instance manager leads to engine/replica cannot be started #3000 (comment)

  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at: Deleting the duplicated default instance manager

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
    The PR for the YAML change is at:
    The PR for the chart change is at:

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at Prevent creating multiple default instance managers on same node longhorn-manager#1599

  • Which areas/issues this PR might have potential impacts on?
    Area upgrade
    Issues

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
    The LEP PR is at

  • If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?
    The UI issue/PR is at

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
    The documentation issue/PR is at

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    The automation skeleton PR is at
    The automation test case PR is at
    The issue of automation test case implementation is at (please create by the template)

  • If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?
    The engine automation PR is at

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at

  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

@PhanLe1010
Copy link
Contributor

Test steps:
Focus on upgrading test

  1. Install Longhorn older versions (1.2x, 1.3.x)
  2. Create some volumes. Attach some. Create some recurring jobs
  3. Upgrade to master-head version
  4. Verify that there is not multiple default instance manager pods on the same node. (I.e., pod with same default instance manager image and instance manager type on same node)
  5. Verify detach/attach volume works fine

@roger-ryao
Copy link

Verified on master-head 20221212

  • longhorn master-head (f30875a)
  • longhorn-manager master-head (81f77c6)

The test steps

Ref #3000 (To Reproduce)

  1. Create 4 volumes.( 2 rwo and 2 rwx )
  2. Create deployment

kubectl apply -f deployment.yaml

deployment.yaml
apiVersion: v1
kind: Service
metadata:
  name: mysql-dep-rwo
  labels:
    app: mysql-dep-rwo
spec:
  ports:
    - port: 3306
  selector:
    app: mysql-dep-rwo
  clusterIP: None
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mysql-dep-rwo-pvc
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: longhorn
  resources:
    requests:
      storage: 0.5Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mysql-dep-rwo
  labels:
    app: mysql-dep-rwo
spec:
  selector:
    matchLabels:
      app: mysql-dep-rwo # has to match .spec.template.metadata.labels
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: mysql-dep-rwo
    spec:
      restartPolicy: Always
      containers:
      - image: mysql:5.6
        name: mysql-dep-rwo
        livenessProbe:
          exec:
            command:
              - ls
              - /var/lib/mysql/lost+found
          initialDelaySeconds: 5
          periodSeconds: 5
        env:
        - name: MYSQL_ROOT_PASSWORD
          value: changeme
        ports:
        - containerPort: 3306
          name: mysql-dep-rwo
        volumeMounts:
        - name: mysql-dep-rwo-volume
          mountPath: /var/lib/mysql
        env:
        - name: MYSQL_ROOT_PASSWORD
          value: "rancher"
      volumes:
      - name: mysql-dep-rwo-volume
        persistentVolumeClaim:
          claimName: mysql-dep-rwo-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: mysql-dep-rwx
  labels:
    app: mysql-dep-rwx
spec:
  ports:
    - port: 3306
  selector:
    app: mysql-dep-rwx
  clusterIP: None
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mysql-dep-rwx-pvc
spec:
  accessModes:
    - ReadWriteMany    
  storageClassName: longhorn
  resources:
    requests:
      storage: 0.5Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mysql-dep-rwx
  labels:
    app: mysql-dep-rwx
spec:
  selector:
    matchLabels:
      app: mysql-dep-rwx # has to match .spec.template.metadata.labels
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: mysql-dep-rwx
    spec:
      restartPolicy: Always
      containers:
      - image: mysql:5.6
        name: mysql-dep-rwx
        livenessProbe:
          exec:
            command:
              - ls
              - /var/lib/mysql/lost+found
          initialDelaySeconds: 5
          periodSeconds: 5
        env:
        - name: MYSQL_ROOT_PASSWORD
          value: changeme
        ports:
        - containerPort: 3306
          name: mysql-dep-rwx
        volumeMounts:
        - name: mysql-dep-rwx-volume
          mountPath: /var/lib/mysql
        env:
        - name: MYSQL_ROOT_PASSWORD
          value: "rancher"
      volumes:
      - name: mysql-dep-rwx-volume
        persistentVolumeClaim:
          claimName: mysql-dep-rwx-pvc
  1. Create statefulset

kubectl apply -f statefulset.yaml

statefulset.yaml
apiVersion: v1
kind: Service
metadata:
  name: nginx-state-rwo
  labels:
    app: nginx-state-rwo
spec:
  ports:
  - port: 80
    name: web-state-rwo
  selector:
    app: nginx-state-rwo
  type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web-state-rwo
spec:
  selector:
    matchLabels:
      app: nginx-state-rwo # has to match .spec.template.metadata.labels
  serviceName: "nginx-state-rwo"
  replicas: 1 # by default is 1
  template:
    metadata:
      labels:
        app: nginx-state-rwo # has to match .spec.selector.matchLabels
    spec:
      restartPolicy: Always
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx-state-rwo
        image: k8s.gcr.io/nginx-slim:0.8
        livenessProbe:
          exec:
            command:
              - ls
              - /usr/share/nginx/html/lost+found
          initialDelaySeconds: 5
          periodSeconds: 5
        ports:
        - containerPort: 80
          name: web-state-rwo
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "longhorn"
      resources:
        requests:
          storage: 0.5Gi
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-state-rwx
  labels:
    app: nginx-state-rwx
spec:
  ports:
  - port: 80
    name: web-state-rwx
  selector:
    app: nginx-state-rwx
  type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web-state-rwx
spec:
  selector:
    matchLabels:
      app: nginx-state-rwx # has to match .spec.template.metadata.labels
  serviceName: "nginx-state-rwx"
  replicas: 1 # by default is 1
  template:
    metadata:
      labels:
        app: nginx-state-rwx # has to match .spec.selector.matchLabels
    spec:
      restartPolicy: Always
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx-state-rwx
        image: k8s.gcr.io/nginx-slim:0.8
        livenessProbe:
          exec:
            command:
              - ls
              - /usr/share/nginx/html/lost+found
          initialDelaySeconds: 5
          periodSeconds: 5
        ports:
        - containerPort: 80
          name: web-state-rwx
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteMany" ]      
      storageClassName: "longhorn"
      resources:
        requests:
          storage: 0.5Gi
  1. Create snapshot recurring jobs
  2. Upgrade to master-head version

Result Passed

  1. The detach/attach volume works fine
  2. There is not multiple default instance manager pods on the same node

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.2.7 backport/1.3.3 component/longhorn-instance-manager Longhorn instance manager (interface between control and data plane) component/longhorn-manager Longhorn manager (control plane) investigation-needed Need to identify the case before estimating and starting the development kind/bug priority/1 Highly recommended to fix in this release (managed by PO)
Projects
Archived in project
Community Issue Review
Resolved/Scheduled
Development

No branches or pull requests

10 participants