[BUG] Duplicated default instance manager leads to engine/replica cannot be started #3000

timmy59100 · 2021-09-13T19:30:52Z

Describe the bug
All existing rwx are not attaching anymore after upgrading to 1.2.

To Reproduce
Existing volumes are not attaching to redeployed pods. Not even after setting the workload to zero. Restarted longhorn components and longhorn nodes.

Expected behavior
Volumes should attach.

Log
If applicable, add the Longhorn managers' log when the issue happens.
sent longhorn bundle

AttachVolume.Attach failed for volume "pvc-93aad038-6dda-482f-a8f6-d237a0414561" : rpc error: code = DeadlineExceeded desc = volume pvc-93aad038-6dda-482f-a8f6-d237a0414561 failed to attach to node pax-p-95

Environment:

Longhorn version: 1.2
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher Catalog App
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: rancher kubernets v1.20.10
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 3
Node config
- OS type and version: Ubuntu 20.04.3 LTS
- CPU per node: 12
- Memory per node: 32G
- Disk type(e.g. SSD/NVMe):
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Xen
Number of Longhorn volumes in the cluster: 80

Additional context
Add any other context about the problem here.

yasker · 2021-09-13T19:36:17Z

Are those using xfs as filesystem? cc @joshimoo

timmy59100 · 2021-09-13T20:15:29Z

We did not specify the filesystem type in the storageclass, so according to #2991 it should be ext4

joshimoo · 2021-09-13T20:47:32Z

@timmy59100 ~~thanks for creating this issue, can you sent us a support bundle to longhorn-support-bundle@Suse.com referencing this issue.~~

Can you also describe the behavior on the cluster, is the share-manager pod in the longhorn-system namespace getting continuously recreated?

PhanLe1010 · 2021-09-13T20:48:16Z

The support bundled was sent over email

timmy59100 · 2021-09-13T20:54:24Z

@joshimoo No, the share-managers stays up with the following log constantly:

time="2021-09-13T20:52:15Z" level=warning msg="waiting with nfs server start, volume is not attached" encrypted=false volume=pvc-93aad038-6dda-482f-a8f6-d237a0414561

joshimoo · 2021-09-13T20:56:38Z

Okay just checked the support bundle, this is unrelated to RWX and is similar to this issue:
#2992

PhanLe1010 · 2021-09-13T21:31:07Z

@timmy59100 Can you try to delete this instance manager kubectl delete instancemanager instance-manager-e-8e53eeae -n longhorn-system ?

timmy59100 · 2021-09-13T21:42:55Z

@PhanLe1010 The one volume got attached after deleting the instancemanager. When I try to start the other workloads they are stuck attaching still.

timmy59100 · 2021-09-13T21:45:51Z

After waiting a bit every volume got attached.

PhanLe1010 · 2021-09-13T21:45:55Z

Can we have a new support bundle?

timmy59100 · 2021-09-13T21:53:28Z

sent over email.

PhanLe1010 · 2021-09-13T21:55:09Z

Which volume is stuck in attaching?

timmy59100 · 2021-09-13T21:57:09Z

After waiting a bit every volume got attached.

How did you know which instancemanager to kill?

PhanLe1010 · 2021-09-13T22:07:14Z

Sorry, I missed the earlier message.

Apparently, it is because there are multiple instance managers with the same spec (nodeID and image) in the cluster so Longhorn manager is confused and don't know which instance manager to pick to start engine/replica: https://github.com/longhorn/longhorn-manager/blob/5da6ccfbba0c4ae441c6b5101ace541c42271f81/datastore/longhorn.go#L2435

I just tell you to delete one of the extra instance manager among the ones that have the same spec.

I am looking into how it is possible to have multiple instance managers with the same spec but I don't know why yet. From code perspective, there shouldn't have multiple instance managers with the same spec: https://github.com/longhorn/longhorn-manager/blob/5da6ccfbba0c4ae441c6b5101ace541c42271f81/controller/node_controller.go#L851-L896

PhanLe1010 · 2021-09-13T22:26:12Z

@timmy59100 Is there any abnormal event that you observed when you upgrading Longhorn? For example, node reboot, ETCD health problem, etc ...

timmy59100 · 2021-09-14T08:42:39Z

@PhanLe1010 It wasn't just after the upgrade. I deleted all instance managers to try to resolve this when the cluster was in a healthy state. But apparently the multiple instance managers with the same spec happened again.

PhanLe1010 · 2021-09-14T17:10:35Z

@timmy59100 Do you mean that if you delete all instance manager now, the problem of multiple instance managers with the same spec will happen?

PhanLe1010 · 2021-09-14T17:11:11Z

I am trying to reproduce the issue so we can find root cause and fix in the next release

timmy59100 · 2021-09-14T21:27:03Z

I don't know for sure if it happens, but deleting all instance managers would crash all workloads?

PhanLe1010 · 2022-12-10T00:30:02Z

From the code flow, it looks to me that multiple default instance managers creation can only happen if this function is run concurrently by different goroutines/processes. The workqueue mechanism in client-go guarantees that there is no 2 go-routines processing the same item simultaneously. This leads me to the theory that maybe there were multiple longhorn manager pods running on the same node at the time. Not 100% sure.

Anyways, the proposed fix is to hash the node name + image name to create a name for the default instance manager. This makes it impossible to create multiple default instance managers on the same node because they would have the same name and ETCD would reject the later request.

longhorn-io-github-bot · 2022-12-10T00:52:17Z

PhanLe1010 · 2022-12-10T01:02:36Z

Test steps:
Focus on upgrading test

Install Longhorn older versions (1.2x, 1.3.x)
Create some volumes. Attach some. Create some recurring jobs
Upgrade to master-head version
Verify that there is not multiple default instance manager pods on the same node. (I.e., pod with same default instance manager image and instance manager type on same node)
Verify detach/attach volume works fine

roger-ryao · 2022-12-12T04:47:31Z

Verified on master-head 20221212

longhorn master-head (f30875a)
longhorn-manager master-head (81f77c6)

The test steps

Ref #3000 (To Reproduce)

Create 4 volumes.( 2 rwo and 2 rwx )
Create deployment

kubectl apply -f deployment.yaml

deployment.yaml

apiVersion: v1
kind: Service
metadata:
  name: mysql-dep-rwo
  labels:
    app: mysql-dep-rwo
spec:
  ports:
    - port: 3306
  selector:
    app: mysql-dep-rwo
  clusterIP: None
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mysql-dep-rwo-pvc
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: longhorn
  resources:
    requests:
      storage: 0.5Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mysql-dep-rwo
  labels:
    app: mysql-dep-rwo
spec:
  selector:
    matchLabels:
      app: mysql-dep-rwo # has to match .spec.template.metadata.labels
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: mysql-dep-rwo
    spec:
      restartPolicy: Always
      containers:
      - image: mysql:5.6
        name: mysql-dep-rwo
        livenessProbe:
          exec:
            command:
              - ls
              - /var/lib/mysql/lost+found
          initialDelaySeconds: 5
          periodSeconds: 5
        env:
        - name: MYSQL_ROOT_PASSWORD
          value: changeme
        ports:
        - containerPort: 3306
          name: mysql-dep-rwo
        volumeMounts:
        - name: mysql-dep-rwo-volume
          mountPath: /var/lib/mysql
        env:
        - name: MYSQL_ROOT_PASSWORD
          value: "rancher"
      volumes:
      - name: mysql-dep-rwo-volume
        persistentVolumeClaim:
          claimName: mysql-dep-rwo-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: mysql-dep-rwx
  labels:
    app: mysql-dep-rwx
spec:
  ports:
    - port: 3306
  selector:
    app: mysql-dep-rwx
  clusterIP: None
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mysql-dep-rwx-pvc
spec:
  accessModes:
    - ReadWriteMany    
  storageClassName: longhorn
  resources:
    requests:
      storage: 0.5Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mysql-dep-rwx
  labels:
    app: mysql-dep-rwx
spec:
  selector:
    matchLabels:
      app: mysql-dep-rwx # has to match .spec.template.metadata.labels
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: mysql-dep-rwx
    spec:
      restartPolicy: Always
      containers:
      - image: mysql:5.6
        name: mysql-dep-rwx
        livenessProbe:
          exec:
            command:
              - ls
              - /var/lib/mysql/lost+found
          initialDelaySeconds: 5
          periodSeconds: 5
        env:
        - name: MYSQL_ROOT_PASSWORD
          value: changeme
        ports:
        - containerPort: 3306
          name: mysql-dep-rwx
        volumeMounts:
        - name: mysql-dep-rwx-volume
          mountPath: /var/lib/mysql
        env:
        - name: MYSQL_ROOT_PASSWORD
          value: "rancher"
      volumes:
      - name: mysql-dep-rwx-volume
        persistentVolumeClaim:
          claimName: mysql-dep-rwx-pvc

Create statefulset

kubectl apply -f statefulset.yaml

statefulset.yaml

apiVersion: v1
kind: Service
metadata:
  name: nginx-state-rwo
  labels:
    app: nginx-state-rwo
spec:
  ports:
  - port: 80
    name: web-state-rwo
  selector:
    app: nginx-state-rwo
  type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web-state-rwo
spec:
  selector:
    matchLabels:
      app: nginx-state-rwo # has to match .spec.template.metadata.labels
  serviceName: "nginx-state-rwo"
  replicas: 1 # by default is 1
  template:
    metadata:
      labels:
        app: nginx-state-rwo # has to match .spec.selector.matchLabels
    spec:
      restartPolicy: Always
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx-state-rwo
        image: k8s.gcr.io/nginx-slim:0.8
        livenessProbe:
          exec:
            command:
              - ls
              - /usr/share/nginx/html/lost+found
          initialDelaySeconds: 5
          periodSeconds: 5
        ports:
        - containerPort: 80
          name: web-state-rwo
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "longhorn"
      resources:
        requests:
          storage: 0.5Gi
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-state-rwx
  labels:
    app: nginx-state-rwx
spec:
  ports:
  - port: 80
    name: web-state-rwx
  selector:
    app: nginx-state-rwx
  type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web-state-rwx
spec:
  selector:
    matchLabels:
      app: nginx-state-rwx # has to match .spec.template.metadata.labels
  serviceName: "nginx-state-rwx"
  replicas: 1 # by default is 1
  template:
    metadata:
      labels:
        app: nginx-state-rwx # has to match .spec.selector.matchLabels
    spec:
      restartPolicy: Always
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx-state-rwx
        image: k8s.gcr.io/nginx-slim:0.8
        livenessProbe:
          exec:
            command:
              - ls
              - /usr/share/nginx/html/lost+found
          initialDelaySeconds: 5
          periodSeconds: 5
        ports:
        - containerPort: 80
          name: web-state-rwx
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteMany" ]      
      storageClassName: "longhorn"
      resources:
        requests:
          storage: 0.5Gi

Create snapshot recurring jobs
Upgrade to master-head version

Result Passed

The detach/attach volume works fine
There is not multiple default instance manager pods on the same node

timmy59100 added the kind/bug label Sep 13, 2021

joshimoo added the area/volume-rwx Volume RWX related label Sep 13, 2021

joshimoo added this to New in Community Issue Review via automation Sep 13, 2021

joshimoo moved this from New to Pending user response in Community Issue Review Sep 13, 2021

joshimoo self-assigned this Sep 13, 2021

joshimoo added component/longhorn-manager Longhorn manager (control plane) component/longhorn-instance-manager Longhorn instance manager (interface between control and data plane) and removed area/volume-rwx Volume RWX related labels Sep 13, 2021

joshimoo changed the title ~~[BUG] 1.2 all existing RWX volumes are not attaching anymore~~ [BUG] 1.2 manager fails to find existing IM for engine Sep 13, 2021

PhanLe1010 added the investigation-needed Need to identify the case before estimating and starting the development label Sep 13, 2021

PhanLe1010 moved this from Pending user response to Team review required in Community Issue Review Sep 13, 2021

PhanLe1010 mentioned this issue Mar 25, 2022

[BUG] Unable to attach or mount volumes: unmounted volumes=[volv], unattached volumes=[volv kube-api-access-4tqrk]: timed out waiting for the condition (duplicated default IM-R) #3753

Open

innobead modified the milestones: Backlog, v1.4.0 Mar 25, 2022

PhanLe1010 changed the title ~~[BUG] 1.2 manager fails to find existing IM for engine~~ [BUG] 1.2 Duplicated default instance manager leads to engine/replica cannot be started Mar 25, 2022

innobead added the duplicated label Mar 25, 2022

innobead added priority/0 Must be fixed in this release (managed by PO) and removed duplicated labels Jul 11, 2022

innobead assigned c3y1huang Jul 11, 2022

innobead added priority/1 Highly recommended to fix in this release (managed by PO) and removed priority/0 Must be fixed in this release (managed by PO) labels Nov 8, 2022

innobead assigned mantissahz and c3y1huang and unassigned c3y1huang and mantissahz Nov 30, 2022

innobead changed the title ~~[BUG] 1.2 Duplicated default instance manager leads to engine/replica cannot be started~~ [BUG] Duplicated default instance manager leads to engine/replica cannot be started Dec 7, 2022

innobead assigned PhanLe1010 and unassigned mantissahz Dec 9, 2022

innobead added the backport/1.3.3 label Dec 9, 2022

github-actions bot mentioned this issue Dec 9, 2022

[BACKPORT][v1.3.3][BUG] Duplicated default instance manager leads to engine/replica cannot be started #5027

Closed

PhanLe1010 mentioned this issue Dec 10, 2022

Prevent creating multiple default instance managers on same node longhorn/longhorn-manager#1599

Merged

innobead added the backport/1.2.7 label Dec 10, 2022

github-actions bot mentioned this issue Dec 10, 2022

[BACKPORT][v1.2.7][BUG] Duplicated default instance manager leads to engine/replica cannot be started #5039

Closed

roger-ryao self-assigned this Dec 12, 2022

roger-ryao closed this as completed Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Duplicated default instance manager leads to engine/replica cannot be started #3000

[BUG] Duplicated default instance manager leads to engine/replica cannot be started #3000

timmy59100 commented Sep 13, 2021 •

edited

yasker commented Sep 13, 2021

timmy59100 commented Sep 13, 2021

joshimoo commented Sep 13, 2021 •

edited

PhanLe1010 commented Sep 13, 2021

timmy59100 commented Sep 13, 2021

joshimoo commented Sep 13, 2021

PhanLe1010 commented Sep 13, 2021

timmy59100 commented Sep 13, 2021

timmy59100 commented Sep 13, 2021

PhanLe1010 commented Sep 13, 2021

timmy59100 commented Sep 13, 2021

PhanLe1010 commented Sep 13, 2021

timmy59100 commented Sep 13, 2021 •

edited

PhanLe1010 commented Sep 13, 2021

PhanLe1010 commented Sep 13, 2021

timmy59100 commented Sep 14, 2021

PhanLe1010 commented Sep 14, 2021

PhanLe1010 commented Sep 14, 2021

timmy59100 commented Sep 14, 2021

PhanLe1010 commented Dec 10, 2022 •

edited

longhorn-io-github-bot commented Dec 10, 2022 •

edited by PhanLe1010

PhanLe1010 commented Dec 10, 2022

roger-ryao commented Dec 12, 2022

[BUG] Duplicated default instance manager leads to engine/replica cannot be started #3000

[BUG] Duplicated default instance manager leads to engine/replica cannot be started #3000

Comments

timmy59100 commented Sep 13, 2021 • edited

yasker commented Sep 13, 2021

timmy59100 commented Sep 13, 2021

joshimoo commented Sep 13, 2021 • edited

PhanLe1010 commented Sep 13, 2021

timmy59100 commented Sep 13, 2021

joshimoo commented Sep 13, 2021

PhanLe1010 commented Sep 13, 2021

timmy59100 commented Sep 13, 2021

timmy59100 commented Sep 13, 2021

PhanLe1010 commented Sep 13, 2021

timmy59100 commented Sep 13, 2021

PhanLe1010 commented Sep 13, 2021

timmy59100 commented Sep 13, 2021 • edited

PhanLe1010 commented Sep 13, 2021

PhanLe1010 commented Sep 13, 2021

timmy59100 commented Sep 14, 2021

PhanLe1010 commented Sep 14, 2021

PhanLe1010 commented Sep 14, 2021

timmy59100 commented Sep 14, 2021

PhanLe1010 commented Dec 10, 2022 • edited

longhorn-io-github-bot commented Dec 10, 2022 • edited by PhanLe1010

Pre Ready-For-Testing Checklist

PhanLe1010 commented Dec 10, 2022

roger-ryao commented Dec 12, 2022

timmy59100 commented Sep 13, 2021 •

edited

joshimoo commented Sep 13, 2021 •

edited

timmy59100 commented Sep 13, 2021 •

edited

PhanLe1010 commented Dec 10, 2022 •

edited

longhorn-io-github-bot commented Dec 10, 2022 •

edited by PhanLe1010