New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Duplicated default instance manager leads to engine/replica cannot be started #3000
Comments
Are those using xfs as filesystem? cc @joshimoo |
We did not specify the filesystem type in the storageclass, so according to #2991 it should be ext4 |
@timmy59100 Can you also describe the behavior on the cluster, is the share-manager pod in the |
The support bundled was sent over email |
@joshimoo No, the share-managers stays up with the following log constantly:
|
Okay just checked the support bundle, this is unrelated to RWX and is similar to this issue: |
@timmy59100 Can you try to delete this instance manager |
@PhanLe1010 The one volume got attached after deleting the instancemanager. When I try to start the other workloads they are stuck attaching still. |
After waiting a bit every volume got attached. |
Can we have a new support bundle? |
sent over email. |
Which volume is stuck in attaching? |
How did you know which instancemanager to kill? |
Sorry, I missed the earlier message. Apparently, it is because there are multiple instance managers with the same spec (nodeID and image) in the cluster so Longhorn manager is confused and don't know which instance manager to pick to start engine/replica: https://github.com/longhorn/longhorn-manager/blob/5da6ccfbba0c4ae441c6b5101ace541c42271f81/datastore/longhorn.go#L2435 I just tell you to delete one of the extra instance manager among the ones that have the same spec. I am looking into how it is possible to have multiple instance managers with the same spec but I don't know why yet. From code perspective, there shouldn't have multiple instance managers with the same spec: https://github.com/longhorn/longhorn-manager/blob/5da6ccfbba0c4ae441c6b5101ace541c42271f81/controller/node_controller.go#L851-L896 |
@timmy59100 Is there any abnormal event that you observed when you upgrading Longhorn? For example, node reboot, ETCD health problem, etc ... |
@PhanLe1010 It wasn't just after the upgrade. I deleted all instance managers to try to resolve this when the cluster was in a healthy state. But apparently the multiple instance managers with the same spec happened again. |
@timmy59100 Do you mean that if you delete all instance manager now, the problem of multiple instance managers with the same spec will happen? |
I am trying to reproduce the issue so we can find root cause and fix in the next release |
I don't know for sure if it happens, but deleting all instance managers would crash all workloads? |
From the code flow, it looks to me that multiple default instance managers creation can only happen if this function is run concurrently by different goroutines/processes. The workqueue mechanism in client-go guarantees that there is no 2 go-routines processing the same item simultaneously. This leads me to the theory that maybe there were multiple longhorn manager pods running on the same node at the time. Not 100% sure. Anyways, the proposed fix is to hash the node name + image name to create a name for the default instance manager. This makes it impossible to create multiple default instance managers on the same node because they would have the same name and ETCD would reject the later request. |
Pre Ready-For-Testing Checklist
|
Test steps:
|
Verified on master-head 20221212 The test steps
deployment.yamlapiVersion: v1
kind: Service
metadata:
name: mysql-dep-rwo
labels:
app: mysql-dep-rwo
spec:
ports:
- port: 3306
selector:
app: mysql-dep-rwo
clusterIP: None
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mysql-dep-rwo-pvc
spec:
accessModes:
- ReadWriteOnce
storageClassName: longhorn
resources:
requests:
storage: 0.5Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: mysql-dep-rwo
labels:
app: mysql-dep-rwo
spec:
selector:
matchLabels:
app: mysql-dep-rwo # has to match .spec.template.metadata.labels
strategy:
type: Recreate
template:
metadata:
labels:
app: mysql-dep-rwo
spec:
restartPolicy: Always
containers:
- image: mysql:5.6
name: mysql-dep-rwo
livenessProbe:
exec:
command:
- ls
- /var/lib/mysql/lost+found
initialDelaySeconds: 5
periodSeconds: 5
env:
- name: MYSQL_ROOT_PASSWORD
value: changeme
ports:
- containerPort: 3306
name: mysql-dep-rwo
volumeMounts:
- name: mysql-dep-rwo-volume
mountPath: /var/lib/mysql
env:
- name: MYSQL_ROOT_PASSWORD
value: "rancher"
volumes:
- name: mysql-dep-rwo-volume
persistentVolumeClaim:
claimName: mysql-dep-rwo-pvc
---
apiVersion: v1
kind: Service
metadata:
name: mysql-dep-rwx
labels:
app: mysql-dep-rwx
spec:
ports:
- port: 3306
selector:
app: mysql-dep-rwx
clusterIP: None
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mysql-dep-rwx-pvc
spec:
accessModes:
- ReadWriteMany
storageClassName: longhorn
resources:
requests:
storage: 0.5Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: mysql-dep-rwx
labels:
app: mysql-dep-rwx
spec:
selector:
matchLabels:
app: mysql-dep-rwx # has to match .spec.template.metadata.labels
strategy:
type: Recreate
template:
metadata:
labels:
app: mysql-dep-rwx
spec:
restartPolicy: Always
containers:
- image: mysql:5.6
name: mysql-dep-rwx
livenessProbe:
exec:
command:
- ls
- /var/lib/mysql/lost+found
initialDelaySeconds: 5
periodSeconds: 5
env:
- name: MYSQL_ROOT_PASSWORD
value: changeme
ports:
- containerPort: 3306
name: mysql-dep-rwx
volumeMounts:
- name: mysql-dep-rwx-volume
mountPath: /var/lib/mysql
env:
- name: MYSQL_ROOT_PASSWORD
value: "rancher"
volumes:
- name: mysql-dep-rwx-volume
persistentVolumeClaim:
claimName: mysql-dep-rwx-pvc
statefulset.yamlapiVersion: v1
kind: Service
metadata:
name: nginx-state-rwo
labels:
app: nginx-state-rwo
spec:
ports:
- port: 80
name: web-state-rwo
selector:
app: nginx-state-rwo
type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web-state-rwo
spec:
selector:
matchLabels:
app: nginx-state-rwo # has to match .spec.template.metadata.labels
serviceName: "nginx-state-rwo"
replicas: 1 # by default is 1
template:
metadata:
labels:
app: nginx-state-rwo # has to match .spec.selector.matchLabels
spec:
restartPolicy: Always
terminationGracePeriodSeconds: 10
containers:
- name: nginx-state-rwo
image: k8s.gcr.io/nginx-slim:0.8
livenessProbe:
exec:
command:
- ls
- /usr/share/nginx/html/lost+found
initialDelaySeconds: 5
periodSeconds: 5
ports:
- containerPort: 80
name: web-state-rwo
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "longhorn"
resources:
requests:
storage: 0.5Gi
---
apiVersion: v1
kind: Service
metadata:
name: nginx-state-rwx
labels:
app: nginx-state-rwx
spec:
ports:
- port: 80
name: web-state-rwx
selector:
app: nginx-state-rwx
type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web-state-rwx
spec:
selector:
matchLabels:
app: nginx-state-rwx # has to match .spec.template.metadata.labels
serviceName: "nginx-state-rwx"
replicas: 1 # by default is 1
template:
metadata:
labels:
app: nginx-state-rwx # has to match .spec.selector.matchLabels
spec:
restartPolicy: Always
terminationGracePeriodSeconds: 10
containers:
- name: nginx-state-rwx
image: k8s.gcr.io/nginx-slim:0.8
livenessProbe:
exec:
command:
- ls
- /usr/share/nginx/html/lost+found
initialDelaySeconds: 5
periodSeconds: 5
ports:
- containerPort: 80
name: web-state-rwx
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteMany" ]
storageClassName: "longhorn"
resources:
requests:
storage: 0.5Gi
Result Passed
|
Describe the bug
All existing rwx are not attaching anymore after upgrading to 1.2.
To Reproduce
Existing volumes are not attaching to redeployed pods. Not even after setting the workload to zero. Restarted longhorn components and longhorn nodes.
Expected behavior
Volumes should attach.
Log
If applicable, add the Longhorn managers' log when the issue happens.
sent longhorn bundle
AttachVolume.Attach failed for volume "pvc-93aad038-6dda-482f-a8f6-d237a0414561" : rpc error: code = DeadlineExceeded desc = volume pvc-93aad038-6dda-482f-a8f6-d237a0414561 failed to attach to node pax-p-95
Environment:
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: