Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Volume attach/detach/delete operations stuck in version 1.6.0 #7915

Open
Eilyre opened this issue Feb 12, 2024 · 23 comments
Open

[BUG] Volume attach/detach/delete operations stuck in version 1.6.0 #7915

Eilyre opened this issue Feb 12, 2024 · 23 comments
Labels
investigation-needed Need to identify the case before estimating and starting the development kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage

Comments

@Eilyre
Copy link

Eilyre commented Feb 12, 2024

Describe the bug

Over the weekend, our production Longhorn cluster degraded in a way where new volume attach/detach/delete operations get stuck (meaning won't finish). Workloads continue working, until the volume needs to be detached and attached on another node.

This happened after fixing the cluster with the proposed solution here #7887, but not instantly as I was able to recreate and run operations on the cluster on Thursday and Friday.

To Reproduce

Unable to reproduce this in any of the test clusters.

Expected behavior

Volume attach/detach/delete operations should finish.

Support bundle for troubleshooting

Support ticket was sent to longhorn-support-bundle@suse.com.

Environment

  • Longhorn version: 1.6.0
  • Impacted volume (PV): All of them.
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: Kubeadm, 1.27.4
    • Number of control plane nodes in the cluster: 3
    • Number of worker nodes in the cluster: 1
  • Node config
    • OS type and version: Flatcar, latest
    • Kernel version: 6.1.73
    • CPU per node: 128 cores
    • Memory per node: 512 GB
    • Disk type (e.g. SSD/NVMe/HDD): NVMe
    • Network bandwidth between the nodes (Gbps): 100 gbps
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
  • Number of Longhorn volumes in the cluster: ~160

Additional context

This only happens in the production cluster.

The volumes are stuck in a very weird state. My replication factor is set to 3 for all volumes, but Longhorn keeps 4 alive.

304000163-49e89c0f-76b6-4e87-8c86-798ba497bc4d

@Eilyre Eilyre added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels Feb 12, 2024
@PhanLe1010
Copy link
Contributor

@Eilyre Could you point me to one of the volumes that is currently stuck in attach/detach?

@Eilyre
Copy link
Author

Eilyre commented Feb 12, 2024

@PhanLe1010

Stuck in detach: pvc-411dd345-c354-4f3e-ac9d-738b39671776, pvc-58297811-5dc4-4ac3-944d-056512745d8d, pvc-0762851a-d674-4854-a210-b1d5787c86f7, pvc-de096a05-870a-4a79-9e29-9f2c67cec105, pvc-e34dd802-84f5-4886-87ec-9401aaa0075a, pvc-ecce39fb-7606-40ee-a58a-7cd6a15965c0
Stuck in deleting state: pvc-a85b1d1f-3717-4f02-8754-c65fbcc2d981, pvc-a3a3b119-3159-4a60-916b-a4d0e039772c, pvc-ade4f6da-5a3f-47e6-b8f2-63f28e37f188

The ones in bold are the ones which I noticed first.
The deleting ones do not show any replicas through the UI.

This information comes from the UI. Currently there's no stuck attaching ones, but it happens when I create a new deployment.

@Eilyre
Copy link
Author

Eilyre commented Feb 12, 2024

Also, extra information @PhanLe1010, replica rebuild does not get scheduled. If I delete a replica for a healthy volume, it will just stay as degraded.

@ejweber
Copy link
Contributor

ejweber commented Feb 12, 2024

  • pvc-411dd345-c354-4f3e-ac9d-738b39671776
    • Detaching.
    • Waiting for pvc-411dd345-c354-4f3e-ac9d-738b39671776-e-592cb9ff to stop.
    • pvc-411dd345-c354-4f3e-ac9d-738b39671776-e-592cb9ff is associated with instance-manager-d7653d361fe2dcc420474ccf47737d4b.
  • pvc-58297811-5dc4-4ac3-944d-056512745d8d
    • Detaching.
    • Waiting for pvc-58297811-5dc4-4ac3-944d-056512745d8d-r-1488fa01 to stop.
    • pvc-58297811-5dc4-4ac3-944d-056512745d8d-r-1488fa01 is associated with instance-manager-d7653d361fe2dcc420474ccf47737d4b.
  • pvc-0762851a-d674-4854-a210-b1d5787c86f7
    • Detaching.
    • Waiting for pvc-0762851a-d674-4854-a210-b1d5787c86f7-e-acbd57c1 to stop.
    • pvc-0762851a-d674-4854-a210-b1d5787c86f7-e-acbd57c1 is associated with instance-manager-d7653d361fe2dcc420474ccf47737d4b.
  • pvc-de096a05-870a-4a79-9e29-9f2c67cec105
    • Detaching.
    • Waiting for pvc-de096a05-870a-4a79-9e29-9f2c67cec105-r-9983c992 to stop.
    • pvc-de096a05-870a-4a79-9e29-9f2c67cec105-r-9983c992 is associated with instance-manager-d7653d361fe2dcc420474ccf47737d4b.
  • pvc-a85b1d1f-3717-4f02-8754-c65fbcc2d981
    • Deleting.
    • Waiting for pvc-a85b1d1f-3717-4f02-8754-c65fbcc2d981-e-3a154d58 to be fully deleted. Its deletionTimestamp is set, but it still appears in the instance map of instance-manager-d7653d361fe2dcc420474ccf47737d4b, so we are likely waiting for it to disappear.

After speaking with @PhanLe1010 and doing some analysis, it appears the instance map of instance-manager-d7653d361fe2dcc420474ccf47737d4b is not being updated. This is adversely affecting different volumes in different ways, depending on the operation you are attempting and whether it is an engine or replica on the associated node.

We see the below log, so it seems longhorn-manager is trying to update the instance map, but instance-manager is not responding in time.

2024-02-12T09:00:29.946208656Z E0212 09:00:29.946089       1 instance_manager_controller.go:1528] failed to poll instance info to update instance manager instance-manager-d7653d361fe2dcc420474ccf47737d4b: failed to list instances: rpc error: code = DeadlineExceeded desc = context deadline exceeded

We may eventually need to delete its pod in the hope of getting it kickstarted, but that will cause any workloads that are still successfully running on it to crash. It's a bit of a shot in the dark, but if you are willing to capture some profiling data, it may give us clues.

kubectl port-forward -n longhorn-system instance-manager-d7653d361fe2dcc420474ccf47737d4b 6060:6060

# In another terminal:
curl --output goroutine.out http://localhost:6060/debug/pprof/goroutine?debug=1
curl --output profile.out http://localhost:6060/debug/pprof/profile?debug=1
curl --output mutex.out http://localhost:6060/debug/pprof/mutex?debug=1

@Eilyre
Copy link
Author

Eilyre commented Feb 12, 2024

@ejweber @PhanLe1010 Sent the *.out files via email.

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Feb 13, 2024

Can we manually check the grpc connection from the Longhorn manager pod longhorn-manager-mch66 to the instance-manager pod instance-manager-d7653d361fe2dcc420474ccf47737d4b by kubectl exec -it -n longhorn-system longhorn-manager-mch66 and run

# Install grpcurl
zypper install wget
wget https://github.com/fullstorydev/grpcurl/releases/download/v1.8.0/grpcurl_1.8.0_linux_x86_64.tar.gz
tar -zxvf grpcurl_1.8.0_linux_x86_64.tar.gz
mv grpcurl /usr/local/bin/

# Call instance manager gRPC APIs
wget https://raw.githubusercontent.com/longhorn/longhorn-instance-manager/master/pkg/imrpc/imrpc.proto
wget https://raw.githubusercontent.com/grpc/grpc/master/src/proto/grpc/health/v1/health.proto

# check the health of grpc server on the instance manager iinstance-manager-d7653d361fe2dcc420474ccf47737d4b. 
# Note you need to double check the IP, is it still 10.0.0.246 ?
grpcurl -d '' -plaintext -import-path ./ -proto health.proto 10.0.0.246:8500 grpc.health.v1.Health/Check
# Provide us the output 

grpcurl -d '' -plaintext -import-path ./ -proto imrpc.proto 10.0.0.246:8500 ProcessManagerService/ProcessList
# Provide us the output 

To double check the IP of instance-manager-d7653d361fe2dcc420474ccf47737d4b, could you run kubectl get pods -n longhorn-system -owide ?

@Eilyre
Copy link
Author

Eilyre commented Feb 13, 2024

Sorry for the delay @PhanLe1010, here's the information:

$ kubectl get pod/instance-manager-d7653d361fe2dcc420474ccf47737d4b -n longhorn-system -owide
NAME                                                READY   STATUS    RESTARTS   AGE     IP           NODE                 NOMINATED NODE   READINESS GATES
instance-manager-d7653d361fe2dcc420474ccf47737d4b   1/1     Running   0          5d15h   10.0.0.246   skylark4.hpc.ut.ee   <none>           <none>

longhorn-manager-mch66:/ # grpcurl -d '' -plaintext -import-path ./ -proto health.proto 10.0.0.246:8500 grpc.health.v1.Health/Check
{
  "status": "SERVING"
}

longhorn-manager-mch66:/ # grpcurl -d '' -plaintext -import-path ./ -proto imrpc.proto 10.0.0.246:8500 ProcessManagerService/ProcessList
<stuck here for over 10 minutes, doesn't finish>

Looking at the process table of longhorn-manager-mch66:

longhorn-manager-mch66:/ # ps axu
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1 52.0  0.0 1740208 513164 ?      Ssl  Feb07 4260:58 longhorn-manager -d daemon --engine-image longhornio/longhorn-engine:v1.6.0 -
root     4150240  0.0  0.0   7092  4440 pts/0    Ss   11:42   0:00 /bin/bash
root     4150270  0.0  0.0      0     0 pts/0    Z    11:42   0:00 [gpgconf] <defunct>
root     4150272  0.0  0.0      0     0 pts/0    Z    11:42   0:00 [gpgconf] <defunct>
root     4150274  0.0  0.0      0     0 pts/0    Z    11:42   0:00 [gpg2] <defunct>
root     4150276  0.0  0.0      0     0 pts/0    Z    11:42   0:00 [gpgsm] <defunct>
root     4150278  0.0  0.0      0     0 pts/0    Z    11:42   0:00 [gpgconf] <defunct>
root     4150280  0.0  0.0      0     0 pts/0    Z    11:42   0:00 [gpg2] <defunct>
root     4150282  0.0  0.0      0     0 pts/0    Z    11:42   0:00 [gpg2] <defunct>
root     4150284  0.0  0.0      0     0 ?        Z    11:42   0:00 [gpg-agent] <defunct>
root     4150285  0.0  0.0      0     0 ?        Zs   11:42   0:00 [gpg-agent] <defunct>
root     4150288  0.0  0.0      0     0 pts/0    Z    11:42   0:00 [gpg2] <defunct>
root     4150290  0.0  0.0      0     0 pts/0    Z    11:42   0:00 [gpg2] <defunct>
root     4150301  0.0  0.0      0     0 pts/0    Z    11:42   0:00 [gpg2] <defunct>
root     4150303  0.0  0.0      0     0 pts/0    Z    11:42   0:00 [gpg2] <defunct>
root     4150305  0.0  0.0      0     0 ?        Z    11:42   0:00 [gpg-agent] <defunct>
root     4150306  0.0  0.0      0     0 ?        Zs   11:42   0:00 [gpg-agent] <defunct>
root     4150309  0.0  0.0      0     0 pts/0    Z    11:42   0:00 [gpg2] <defunct>
root     4150311  0.0  0.0      0     0 pts/0    Z    11:42   0:00 [gpg2] <defunct>
root     4150313  0.0  0.0      0     0 pts/0    Z    11:42   0:00 [gpg2] <defunct>
root     4150315  0.0  0.0      0     0 ?        Z    11:42   0:00 [gpg-agent] <defunct>
root     4150316  0.0  0.0      0     0 ?        Zs   11:42   0:00 [gpg-agent] <defunct>
root     4150319  0.0  0.0      0     0 pts/0    Z    11:42   0:00 [gpg2] <defunct>
root     4150321  0.0  0.0      0     0 pts/0    Z    11:42   0:00 [gpg2] <defunct>
root     4150323  0.0  0.0      0     0 pts/0    Z    11:42   0:00 [gpg2] <defunct>
root     4150325  0.0  0.0      0     0 pts/0    Z    11:42   0:00 [gpg2] <defunct>
root     4150327  0.0  0.0      0     0 pts/0    Z    11:42   0:00 [gpg2] <defunct>
root     4150329  0.0  0.0      0     0 pts/0    Z    11:42   0:00 [gpg2] <defunct>

So instance manager does not seem to be answering, but what is weird, is that instance manager processes seem to be fine:

instance-manager-d7653d361fe2dcc420474ccf47737d4b:/ # ps axu | grep Z 
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     1234204  0.0  0.0   5288   812 pts/0    S+   12:20   0:00 grep Z

So maybe it's not instance-manager that is dead, but longhorn-manager itself? Is restarting longhorn-manager a safe operation? I presume there's no way around restarting the components/node, so I am planning a maintenance.

I also checked the logs and metrics, and I cannot really find any reasons why these processes would be killed. There's no limits hit for longhorn, the node seems otherwise fine, and longhorn has the only zombie processes.

@ejweber
Copy link
Contributor

ejweber commented Feb 13, 2024

@Eilyre, I'll be looking into this again this morning.

longhorn-manager-mch66:/ # grpcurl -d '' -plaintext -import-path ./ -proto imrpc.proto 10.0.0.246:8500 ProcessManagerService/ProcessList
<stuck here for over 10 minutes, doesn't finish>

Something like this is what I think I am expecting. In general, we know that the instance-manager process is running, because it continues to log in the support bundle, but maybe one of its gRPC services is hung somehow. Similarly, it seems longhorn-manager is also logging (quite a lot), so it seems it is still alive as well.

So maybe it's not instance-manager that is dead, but longhorn-manager itself? Is restarting longhorn-manager a safe operation? I presume there's no way around restarting the components/node, so I am planning a maintenance.

TBH, I don't have much confidence it will help, but restarting longhorn-manager itself is a safe operation. It should not affect running volumes negatively.

@ejweber
Copy link
Contributor

ejweber commented Feb 13, 2024

Apparently mutex profiling is disabled by default (I probably should have known that), so mutex.out doesn't have anything interesting. However, goroutine.out shows thousands of goroutines within instance manager's process manager component waiting to acquire a lock. This fits with the other information we have. The instance-manager process is running and generally responsive at the network layer, but cannot successfully complete some operations.

goroutine profile: total 9160
6032 @ 0x43e44e 0x44f998 0x44f96f 0x46d5c5 0xa60625 0xa605ed 0xa5fbd3 0x924709 0x8f5283 0x8fa38c 0x8f31f9 0x4715c1
#	0x46d5c4	sync.runtime_SemacquireRWMutexR+0x24										/usr/local/go/src/runtime/sema.go:82
#	0xa60624	sync.(*RWMutex).RLock+0x64											/usr/local/go/src/sync/rwmutex.go:71
#	0xa605ec	github.com/longhorn/longhorn-instance-manager/pkg/process.(*Manager).findProcess+0x2c				/go/src/github.com/longhorn/longhorn-instance-manager/pkg/process/process_manager.go:349
#	0xa5fbd2	github.com/longhorn/longhorn-instance-manager/pkg/process.(*Manager).ProcessDelete+0x92				/go/src/github.com/longhorn/longhorn-instance-manager/pkg/process/process_manager.go:274
#	0x924708	github.com/longhorn/longhorn-instance-manager/pkg/imrpc._ProcessManagerService_ProcessDelete_Handler+0x168	/go/src/github.com/longhorn/longhorn-instance-manager/pkg/imrpc/imrpc.pb.go:1302
#	0x8f5282	google.golang.org/grpc.(*Server).processUnaryRPC+0xe02								/go/src/github.com/longhorn/longhorn-instance-manager/vendor/google.golang.org/grpc/server.go:1372
#	0x8fa38b	google.golang.org/grpc.(*Server).handleStream+0xfeb								/go/src/github.com/longhorn/longhorn-instance-manager/vendor/google.golang.org/grpc/server.go:1783
#	0x8f31f8	google.golang.org/grpc.(*Server).serveStreams.func2.1+0x58							/go/src/github.com/longhorn/longhorn-instance-manager/vendor/google.golang.org/grpc/server.go:1016

1457 @ 0x43e44e 0x44f998 0x44f96f 0x46d5c5 0xa60625 0xa605ed 0xa60765 0x9249c9 0x8f5283 0x8fa38c 0x8f31f9 0x4715c1
#	0x46d5c4	sync.runtime_SemacquireRWMutexR+0x24									/usr/local/go/src/runtime/sema.go:82
#	0xa60624	sync.(*RWMutex).RLock+0x64										/usr/local/go/src/sync/rwmutex.go:71
#	0xa605ec	github.com/longhorn/longhorn-instance-manager/pkg/process.(*Manager).findProcess+0x2c			/go/src/github.com/longhorn/longhorn-instance-manager/pkg/process/process_manager.go:349
#	0xa60764	github.com/longhorn/longhorn-instance-manager/pkg/process.(*Manager).ProcessGet+0x24			/go/src/github.com/longhorn/longhorn-instance-manager/pkg/process/process_manager.go:358
#	0x9249c8	github.com/longhorn/longhorn-instance-manager/pkg/imrpc._ProcessManagerService_ProcessGet_Handler+0x168	/go/src/github.com/longhorn/longhorn-instance-manager/pkg/imrpc/imrpc.pb.go:1320
#	0x8f5282	google.golang.org/grpc.(*Server).processUnaryRPC+0xe02							/go/src/github.com/longhorn/longhorn-instance-manager/vendor/google.golang.org/grpc/server.go:1372
#	0x8fa38b	google.golang.org/grpc.(*Server).handleStream+0xfeb							/go/src/github.com/longhorn/longhorn-instance-manager/vendor/google.golang.org/grpc/server.go:1783
#	0x8f31f8	google.golang.org/grpc.(*Server).serveStreams.func2.1+0x58						/go/src/github.com/longhorn/longhorn-instance-manager/vendor/google.golang.org/grpc/server.go:1016

1129 @ 0x43e44e 0x44f998 0x44f96f 0x46d5c5 0xa60885 0xa60861 0x924c89 0x8f5283 0x8fa38c 0x8f31f9 0x4715c1
#	0x46d5c4	sync.runtime_SemacquireRWMutexR+0x24										/usr/local/go/src/runtime/sema.go:82
#	0xa60884	sync.(*RWMutex).RLock+0x64											/usr/local/go/src/sync/rwmutex.go:71
#	0xa60860	github.com/longhorn/longhorn-instance-manager/pkg/process.(*Manager).ProcessList+0x40				/go/src/github.com/longhorn/longhorn-instance-manager/pkg/process/process_manager.go:367
#	0x924c88	github.com/longhorn/longhorn-instance-manager/pkg/imrpc._ProcessManagerService_ProcessList_Handler+0x168	/go/src/github.com/longhorn/longhorn-instance-manager/pkg/imrpc/imrpc.pb.go:1338
#	0x8f5282	google.golang.org/grpc.(*Server).processUnaryRPC+0xe02								/go/src/github.com/longhorn/longhorn-instance-manager/vendor/google.golang.org/grpc/server.go:1372
#	0x8fa38b	google.golang.org/grpc.(*Server).handleStream+0xfeb								/go/src/github.com/longhorn/longhorn-instance-manager/vendor/google.golang.org/grpc/server.go:1783
#	0x8f31f8	google.golang.org/grpc.(*Server).serveStreams.func2.1+0x58							/go/src/github.com/longhorn/longhorn-instance-manager/vendor/google.golang.org/grpc/server.go:1016

If you are up for it, please send the following:

  • A similar goroutine.out from your other running instance-managers for contrast. That can help us understand whether the issue is isolated to this one.
  • A slightly different goroutine.out from instance-manager-d7653d361fe2dcc420474ccf47737d4b captured with the following command. This one gives information on each goroutine individually, including how long it has been waiting, and on what.
kubectl port-forward -n longhorn-system instance-manager-d7653d361fe2dcc420474ccf47737d4b 6060:6060

# In another terminal:
curl --output goroutine.out http://localhost:6060/debug/pprof/goroutine?debug=2

@Eilyre
Copy link
Author

Eilyre commented Feb 13, 2024

@ejweber Thanks you for looking into this issue in such detail!

I did not do anything yet (so no manager restarts). I have six instance-manager pods running.

kubectl get pods -n longhorn-system -o wide | grep instance-manager
instance-manager-182143d1fb2d12331056eac5af09545e        1/1     Running   0               5d21h   10.0.6.54    skylark3   <none>           <none>
instance-manager-2e8a523d8175db7f11d2701a4ffefc33        1/1     Running   0               6d10h   10.0.6.167   skylark3   <none>           <none>
instance-manager-403308fa4b7d883e8d29b2051fef21c9        1/1     Running   0               6d8h    10.0.7.190   skylark1   <none>           <none>
instance-manager-633f2fd74d46ad74eed5fa51887d7ff5        1/1     Running   0               5d21h   10.0.3.195   skylark2   <none>           <none>
instance-manager-d7653d361fe2dcc420474ccf47737d4b        1/1     Running   0               5d21h   10.0.0.246   skylark4   <none>           <none>
instance-manager-eb41e75a5c50cf128f95b2a6fe8c277a        1/1     Running   0               5d21h   10.0.7.215   skylark1   <none>           <none>

Four of them were nicely able to port-forward and give me the goroutine information:

instance-manager-182143d1fb2d12331056eac5af09545e.out  
instance-manager-d7653d361fe2dcc420474ccf47737d4b.out
instance-manager-633f2fd74d46ad74eed5fa51887d7ff5.out  
instance-manager-eb41e75a5c50cf128f95b2a6fe8c277a.out

But two of them do not have anything listening on port 6060! I do not know if this is normal.

instance-manager-2e8a523d8175db7f11d2701a4ffefc33:/ # ss -t4l
State           Recv-Q          Send-Q                   Local Address:Port                           Peer Address:Port          Process
LISTEN          0               4096                           0.0.0.0:iscsi-target                        0.0.0.0:*

instance-manager-403308fa4b7d883e8d29b2051fef21c9:/ # ss -t4l
State           Recv-Q          Send-Q                   Local Address:Port                           Peer Address:Port          Process
LISTEN          0               4096                           0.0.0.0:iscsi-target                        0.0.0.0:*

So I was unable to run curl against them as well.

I sent the files via email, to the same thread. The special one is suffixed with -different.

@ejweber
Copy link
Contributor

ejweber commented Feb 13, 2024

But two of them do not have anything listening on port 6060! I do not know if this is normal.

No worries there, I think. Those two are still on v1.5.3. We didn't have pprof support in that version.

@ejweber
Copy link
Contributor

ejweber commented Feb 13, 2024

Thanks to your help @Eilyre, we think we have identified a deadlock in the v1.6.0 version of instance-manager. I am going to create a new bug with just the relevant details and link it back to this one.

Unfortunately, we think the only resolution to your current situation is to restart the instance manager pod. This will be disruptive to workloads. If it is possible to do so, scaling down workloads running on the affected node before restarting the pod is recommended. While the associated volumes won't detach or move, this should ensure applications aren't writing to the volumes when their engines stop.

@Eilyre
Copy link
Author

Eilyre commented Feb 13, 2024

Thank you so much for the help @ejweber @PhanLe1010!

Draining the node and restarting the instance-manager (and longhorn-manager, just in case) pods brought the cluster out of the deadlock, operations started happening again. Gave a bit of time, and the dust has settled nicely, no issues.

Just for future reference, I understand this is probably a rare occurrence, but is there a way to detect this kind of deadlock?

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Feb 13, 2024

Just for future reference, I understand this is probably a rare occurrence, but is there a way to detect this kind of deadlock?

I think the biggest signal is that your Longhorn volume failed to finish attach/detach thus your workload stuck in pending state

Then, we would have to look into the logs to figure it out. In this case the hint is from the log failed to poll instance info to update instance manager...

@james-munson
Copy link
Contributor

james-munson commented Feb 13, 2024

Hi, @Eilyre, would you mind collecting and posting a follow-up support bundle when you get a chance, so we can check that all the anomalous things we saw were artifacts of the stuck update and have now resolved themselves? Thanks!

@Eilyre
Copy link
Author

Eilyre commented Feb 13, 2024

Hey @james-munson, sent via email like before.

@innobead innobead added the investigation-needed Need to identify the case before estimating and starting the development label Feb 14, 2024
@Eilyre
Copy link
Author

Eilyre commented Feb 15, 2024

Hey @ejweber! Either my cluster is special again, or it's easier to hit the deadlock on an active cluster than you figured - I managed to hit it again. On a different node this time around.

E0215 13:58:11.961879       1 instance_manager_controller.go:1528] failed to poll instance info to update instance manager instance-manager-182143d1fb2d12331056eac5af09545e: failed to list instances: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Restarting the stuck instance-manager pod helped again.

@ejweber
Copy link
Contributor

ejweber commented Feb 15, 2024

I'm sorry to hear you hit it again so soon, @Eilyre. I would not have expected that.

Since I am not sure when we will have a followup release, I created the following Docker image with the following changes:

If you would like to run this modified instance-manager, you can:

  • Modify the --instance-manager-image setting with kubectl edit -n longhorn-system ds longhorn-manager, or
  • Helm upgrade with a modified image.longhorn.instanceManager variable.

If you choose to do so, you'll see multiple instance-managers running on each node until all of its engine and replica processes have stopped and started again. (You'll still be potentially vulnerable to this issue until all have switched over.)

eweber@laptop:~/longhorn> kl get instancemanager -oyaml | grep -e "name: instance" -e " image"
    name: instance-manager-31e487eb03dc04b4069e66277832d251
    image: longhornio/longhorn-instance-manager:v1.6.0
    name: instance-manager-615f4d04b2fbe572f1be9c5f885d4ce0
    image: webericjames/longhorn-instance-manager:c49b76187e1c80dff7dfd22e0fc980a7200fb585
    name: instance-manager-654318484492ae94dfe9f093e49da60e
    image: webericjames/longhorn-instance-manager:c49b76187e1c80dff7dfd22e0fc980a7200fb585
    name: instance-manager-ab27d6aa6e667225af60981bde317ad0
    image: webericjames/longhorn-instance-manager:c49b76187e1c80dff7dfd22e0fc980a7200fb585
    name: instance-manager-be8dc970931ca0a5de5564b16bd8ed62
    image: longhornio/longhorn-instance-manager:v1.6.0
    name: instance-manager-fc3f5342bd2b0053bdc23bf1d630d65c
    image: longhornio/longhorn-instance-manager:v1.6.0

If you pick up the modified version, please let us know how it goes. If not, that's completely understandable.

@Eilyre
Copy link
Author

Eilyre commented Feb 16, 2024

Thank you for the modified version @ejweber! I talked with my team, and we figured we'd apply it when we hit the deadlock for the third time (as then we need to detach/attach the volumes anyways). Hopefully it'll be useful information for you - how often it happens - as well.

@hef
Copy link

hef commented Feb 19, 2024

If you pick up the modified version, please let us know how it goes. If not, that's completely understandable.

I attempted to use these images, but I believe they are not built for multiarch. I have mixed architectures.

@Eilyre
Copy link
Author

Eilyre commented Feb 19, 2024

Hey @ejweber! We hit the same issue again only 26 hours later (didn't notice), and a few days later with another instance manager at the same time. I now upgraded to your version of instance manager, and rebooted two of the stuck instance managers.

There seems to have been some initial issues with volumes whose primary replica was attached to workload, by an instance manager with an old version. When these volumes tried to schedule new replicas on the nodes with new instance manager, they got weird errors.

The errors:

failed to add replica tcp://10.0.7.235:10812 for volume: rpc error: code = Unknown desc = failed to create replica tcp://10.0.7.235:10812 for volume 10.0.3.169:10508: rpc error: code = Unknown desc = snapshot count usage 20 is equal or larger than snapshotMaxCount 20" 
'FailedDeleting' Failed to remove rebuilding replica pvc-42d55114-7a2e-4403-9584-adaf38e53087-r-fe9c2491 with address 10.0.3.169:10060 for engine pvc-42d55114-7a2e-4403-9584-adaf38e53087-e-0 and volume pvc-42d55114-7a2e-4403-9584-adaf38e53087 due to rebuilding failure: proxyServer=10.0.3.169:8501 destination=10.0.3.169:10242: failed to remove replica tcp://10.0.3.169:10060 for volume: rpc error: code = Internal desc = grpc: error while marshaling: proto: Marshal called with nil" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:298"

Changing the max replica count and downscaling the workload helped but.. can and should snapshot max count block replica rebuild?

I'll let you know if there's any issues with the custom version, thank you!

@ejweber
Copy link
Contributor

ejweber commented Feb 21, 2024

@Eilyre, I checked the code and it is expected. We must take a snapshot at the start of the rebuild. This allows anything written before that moment to become immutable (so it can be transferred to the new replica safely). Any writes after that moment can be issued to all replicas (including the rebuilding one). At the same time, snapshot-max-count must be honored.

It is unusual to unexpectedly have 20 snapshots of a volume. Were these snapshots created intentionally (by a recurring job), or were they system snapshots? If they were system snapshots, is auto-cleanup-system-generated-snapshots enabled?

If this happens again, it should be possible to delete a snapshot for the volume in question (e.g. via the UI) instead of increasing the limit.

Feel free to open a docs issue if you feel there is somewhere we can better document this behavior.

@Eilyre
Copy link
Author

Eilyre commented Feb 21, 2024

@ejweber They were created during automatic jobs (daily snapshot) before this new version introduced the snapshot-max-count option. I'll enable the cleanup options, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigation-needed Need to identify the case before estimating and starting the development kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Projects
Status: In progress
Development

No branches or pull requests

6 participants