[BUG] CSI components CrashLoopBackOff, failed to connect to unix://csi/csi.sock after cluster restart #7116

yangchiu · 2023-11-16T03:20:20Z

Describe the bug (🐛 if you encounter this issue)

After cluster restart (reboot all k8s nodes including control plane node), all csi components got stuck in CrashLoopBackOff:

╰─$ kl get pods
NAME                                                     READY   STATUS             RESTARTS          AGE
longhorn-manager-pxlcx                                   1/1     Running            16 (10h ago)      15h
longhorn-manager-gr26g                                   1/1     Running            16 (10h ago)      15h
engine-image-ei-b907910b-m6gr4                           1/1     Running            16 (10h ago)      15h
instance-manager-bb7e4a2035b50ec3896800262c56800c        1/1     Running            0                 10h
engine-image-ei-b907910b-7vfnt                           1/1     Running            16 (10h ago)      15h
instance-manager-dca4e951d12ab6ab6f714f42f1a2416e        1/1     Running            0                 10h
csi-attacher-67f8c99bcd-9fqwj                            1/1     Running            20 (10h ago)      15h
csi-snapshotter-6f6bf9c757-t68d7                         1/1     Running            19 (10h ago)      15h
csi-resizer-545b7c64f5-7l2hv                             1/1     Running            19 (10h ago)      15h
csi-attacher-67f8c99bcd-2chq8                            1/1     Running            19 (10h ago)      15h
csi-provisioner-557c7f7c44-g2x77                         1/1     Running            20 (10h ago)      15h
longhorn-driver-deployer-7f94bb668f-s7zwg                1/1     Running            16 (10h ago)      15h
longhorn-csi-plugin-h5xqw                                3/3     Running            72 (10h ago)      15h
longhorn-ui-646f4bc8df-tbmtt                             1/1     Running            29 (10h ago)      15h
csi-resizer-545b7c64f5-gzm2z                             1/1     Running            19 (10h ago)      15h
csi-provisioner-557c7f7c44-2p6lc                         1/1     Running            19 (10h ago)      15h
engine-image-ei-b907910b-9ztgm                           1/1     Running            16 (10h ago)      15h
csi-snapshotter-6f6bf9c757-jqqp9                         1/1     Running            19 (10h ago)      15h
longhorn-csi-plugin-4sqkv                                3/3     Running            69 (10h ago)      15h
longhorn-manager-dfn6d                                   1/1     Running            16 (10h ago)      15h
instance-manager-0435193fed60526c87fd3a53fb03ba39        1/1     Running            0                 10h
share-manager-pvc-911b2b67-1e45-4f26-914f-23f2090af998   1/1     Running            0                 10h
share-manager-pvc-81c99bd1-b424-4be2-a075-80e8b17334ab   1/1     Running            0                 10h
longhorn-ui-646f4bc8df-c9tqj                             1/1     Running            35 (10h ago)      15h
csi-attacher-67f8c99bcd-jsj54                            0/1     CrashLoopBackOff   141 (2m49s ago)   15h
longhorn-csi-plugin-kxtqn                                0/3     CrashLoopBackOff   535 (2m18s ago)   15h
csi-snapshotter-6f6bf9c757-xwmmn                         0/1     CrashLoopBackOff   141 (2m14s ago)   15h
csi-provisioner-557c7f7c44-wd57k                         0/1     CrashLoopBackOff   141 (2m5s ago)    15h
csi-resizer-545b7c64f5-8jxk4                             0/1     CrashLoopBackOff   141 (61s ago)     15h

They are all unable to connect to unix://csi/csi.sock:

╰─$ kl logs csi-attacher-67f8c99bcd-jsj54
I1116 01:28:03.328201       1 main.go:97] Version: v4.4.0
W1116 01:28:13.329951       1 connection.go:183] Still connecting to unix:///csi/csi.sock
W1116 01:28:23.329624       1 connection.go:183] Still connecting to unix:///csi/csi.sock
W1116 01:28:33.330048       1 connection.go:183] Still connecting to unix:///csi/csi.sock
E1116 01:28:33.330124       1 main.go:136] context deadline exceeded

To Reproduce

Run negative test case Restart Cluster While Workload Heavy Writing repeatedly.

Expected behavior

Support bundle for troubleshooting

supportbundle_ae1d1892-8da7-4733-97d7-16326976bb0e_2023-11-16T03-17-09Z.zip

worker nodes logs:
worker_nodes_log.txt

Environment

Longhorn version: v1.5.x-head
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.27.1+k3s1
- Number of management node in the cluster:
- Number of worker node in the cluster:
Node config
- OS type and version:
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type(e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:
Impacted Longhorn resources:
- Volume names:

Additional context

https://suse.slack.com/archives/C02DR3N5T24/p1700095454722879?thread_ts=1699951700.806219&cid=C02DR3N5T24

The text was updated successfully, but these errors were encountered:

ejweber · 2023-12-04T22:59:57Z

I cannot find any clue in the support bundle. Since the longhorn-csi-plugin is being repeatedly restarted, even if it fails to listen on csi.sock one time, it should succeed the next time, and the system should stabilize.

My immediate guess was SELinux. However, SELinux should have been set to permissive given the Jenkins parameters (I am not able to confirm or deny this from any of the logs we have). Additionally, it should have affected all nodes similarly.

I kicked off https://ci.longhorn.io/job/private/job/longhorn-e2e-test/129/ to try to reproduce it for live debugging. If that doesn't work, we will likely need to wait for another occurence.

PhanLe1010 · 2023-12-09T01:05:31Z

I hit this one when restarting a one-node cluster. I was planning to create a ticket but looks like one already exist here.
The issue in my case is that longhorn-csi-plugin stays in crashing loop for about 4-5 mins then it recovers. Not sure if this is just slowing issue

ejweber · 2023-12-11T14:25:35Z

I kicked off https://ci.longhorn.io/job/private/job/longhorn-e2e-test/129/ to try to reproduce it for live debugging. If that doesn't work, we will likely need to wait for another occurrence.

This didn't help to reproduce the issue.

I hit this one when restarting a one-node cluster. I was planning to create a ticket but looks like one already exist here.
The issue in my case is that longhorn-csi-plugin stays in crashing loop for about 4-5 mins then it recovers. Not sure if this is just slowing issue

I'll look into this a bit when I have time to see if there is an improvement we can make. It seems different from the original event, since, in that case, the CSI components never recovered.

yaleman · 2023-12-16T03:15:35Z

I'm having this same issue, what kind of troubleshooting can we actually perform to identify the cause? This is on a five node cluster and it was working fine until a power issue took it out, now one node's looping on this error:

Still connecting to unix:///csi/csi.sock

ejweber · 2023-12-18T19:45:17Z

Hello @yaleman. Could we have a support bundle please to confirm the identical symptoms and look for additional clues?

Please upload a it (or a paste a link) here, or send it to longhorn-support-bundle@suse.com.

Also, per the analysis below:

What version of longhornio/livenessprobe is the longhorn-csi-plugin pod running in your cluster?
After you have captured a support bundle, does terminating the offending longhorn-csi-plugin pod resolve the issue? This worked in my reproduce.

ejweber · 2023-12-18T21:33:03Z

A 30 second timeout was introduced to github.com/kubernetes-csi/csi-lib-utils/connection.Connect in this commit. The commit is included in github.com/kubernetes-csi/csi-lib-utils >= v1.14.0.
The latest version (v.2.11.0) of github.com/kubernetes-csi/livenessprobe picked up that change in this commit.
In the support bundle provided by @longhorn/qa, we were running longhornio/livenessprobe:v2.11.0. I'm not sure yet why this was the case, since we deploy longhornio/livenessprobe:v2.9.0 by default still. Maybe we were actively testing an improvement?

Here are the latest longhorn-csi-plugin logs from the support bundle:

2023-11-16T03:13:13.665926468Z time="2023-11-16T03:13:13Z" level=info msg="CSI Driver: driver.longhorn.io version: v1.6.0-dev, manager URL http://longhorn-backend:9500/v1" func="csi.(*Manager).Run" file="manager.go:23"
2023-11-16T03:13:13.687607928Z time="2023-11-16T03:13:13Z" level=info msg="Enabling node service capability: GET_VOLUME_STATS" func=csi.getNodeServiceCapabilities file="node_server.go:756"
2023-11-16T03:13:13.687626652Z time="2023-11-16T03:13:13Z" level=info msg="Enabling node service capability: STAGE_UNSTAGE_VOLUME" func=csi.getNodeServiceCapabilities file="node_server.go:756"
2023-11-16T03:13:13.687648727Z time="2023-11-16T03:13:13Z" level=info msg="Enabling node service capability: EXPAND_VOLUME" func=csi.getNodeServiceCapabilities file="node_server.go:756"
2023-11-16T03:13:13.687698615Z time="2023-11-16T03:13:13Z" level=info msg="Enabling controller service capability: CREATE_DELETE_VOLUME" func=csi.getControllerServiceCapabilities file="controller_server.go:1264"
2023-11-16T03:13:13.687704255Z time="2023-11-16T03:13:13Z" level=info msg="Enabling controller service capability: PUBLISH_UNPUBLISH_VOLUME" func=csi.getControllerServiceCapabilities file="controller_server.go:1264"
2023-11-16T03:13:13.687707962Z time="2023-11-16T03:13:13Z" level=info msg="Enabling controller service capability: EXPAND_VOLUME" func=csi.getControllerServiceCapabilities file="controller_server.go:1264"
2023-11-16T03:13:13.687711070Z time="2023-11-16T03:13:13Z" level=info msg="Enabling controller service capability: CREATE_DELETE_SNAPSHOT" func=csi.getControllerServiceCapabilities file="controller_server.go:1264"
2023-11-16T03:13:13.687713239Z time="2023-11-16T03:13:13Z" level=info msg="Enabling controller service capability: CLONE_VOLUME" func=csi.getControllerServiceCapabilities file="controller_server.go:1264"
2023-11-16T03:13:13.687792201Z time="2023-11-16T03:13:13Z" level=info msg="Enabling volume access mode: SINGLE_NODE_WRITER" func=csi.getVolumeCapabilityAccessModes file="controller_server.go:1280"
2023-11-16T03:13:13.687797371Z time="2023-11-16T03:13:13Z" level=info msg="Enabling volume access mode: MULTI_NODE_MULTI_WRITER" func=csi.getVolumeCapabilityAccessModes file="controller_server.go:1280"
2023-11-16T03:13:13.688131494Z time="2023-11-16T03:13:13Z" level=info msg="Listening for connections on address: &net.UnixAddr{Name:\"//csi/csi.sock\", Net:\"unix\"}" func="csi.(*NonBlockingGRPCServer).serve" file="server.go:100"

There is no evidence of anything wrong with longhorn-csi-plugin in this latest run.

Here are the latest livenessprobe logs from the support bundle:

2023-11-16T03:15:31.406665514Z W1116 03:15:31.406497   17969 connection.go:183] Still connecting to unix:///csi/csi.sock
2023-11-16T03:15:41.406718665Z W1116 03:15:41.406526   17969 connection.go:183] Still connecting to unix:///csi/csi.sock
2023-11-16T03:15:51.406020666Z W1116 03:15:51.405855   17969 connection.go:183] Still connecting to unix:///csi/csi.sock
2023-11-16T03:15:51.406094710Z F1116 03:15:51.405889   17969 main.go:146] failed to establish connection to CSI driver: context deadline exceeded

30 seconds after livenessprobe starts, it fails. From the code livenessprobe is expected to loop here indefinitely instead of failing. Due to the changes mentioned above, that does not happen.

The following sequence of events seems possible (though I have not managed to reproduce it):

longhorn-csi-plugin fails early and often. This is speculation, but maybe the longhorn-backend service is not up yet, so it cannot connect as it initializes.
Instead of waiting forever to connect to longhorn-csi-plugin via csi.sock, livenessprobe fails every 30 seconds.
Both longhorn-csi-plugin and livenessprobe fall into exponential restart backoff. As the backoff grows, it becomes possible for livenessprobe to start, fail to connect for 30 seconds, and die while longhorn-csi-plugin is waiting to start.
Eventually, longhorn-csi-plugin is able to consistently start successfully (as seen in the logs above). However, it always manages to do so at a time when livenessprobe is down.
Without livenessprobe, the liveness probe configured on longhorn-csi-plugin is guaranteed to fail. Kubernetes kills it.
Without longhorn-csi-plugin, livenessprobe cannot start. It fails (as seen in the logs above), and Kubernetes restarts it.

Additional evidence:

From the support bundle, longhorn-csi-plugin is in 5m0s exponential backoff. It last finished at 2023-11-16T03:15:51Z after having been up for 15s. Its exit code is 143, which is associated with graceful termination at the request of Kubernetes.
From the support bundle, livenessprobe is in 5m0s exponential backoff. It last finished at 2023-11-16T03:15:51Z.
If we continue this pattern, longhorn-csi-plugin and livenessprobe may eventually run successfully at the same time, but clearly the next many iterations will result in failure.

ejweber · 2023-12-18T21:49:07Z

In the support bundle provided by @longhorn/qa, we were running longhornio/livenessprobe:v2.11.0. I'm not sure yet why this was the case, since we deploy longhornio/livenessprobe:v2.9.0 by default still. Maybe we were actively testing an improvement?

One mystery solved. #6920 bumped the version of livenessprobe (and other CSI components) in deployment manifests. This change hasn't made it to any released versions yet, though. It is expected to see it in v1.5.x and master test builds, but not out in the wild.

ejweber · 2023-12-18T22:49:07Z

Reproduce steps:

kubectl edit -n longhorn-system service longhorn-backend and change the port from 9500 to 9501. This effectively breaks the longhorn-backend service temporarily.
kubectl delete -n longhorn-system pod longhorn-csi-plugin-<xxxxxx>.
Wait some length of time. All three containers go into CrashLoopBackoff.
kubectl edit -n longhorn-system service longhorn-backend and restore the correct port.
Monitor longhorn-csi-plugin. It can never recover (or at least not for a long time) and exhibits the behavior described above.

In the example below, I kept the longhorn-backend service broken for 6m30s. The situation was not improved 7m30s after that.

longhorn-csi-plugin-6zxtn                           0/3     CrashLoopBackOff   16 (1s ago)     6m36s
longhorn-csi-plugin-6zxtn                           2/3     CrashLoopBackOff   18 (2m11s ago)   8m46s
longhorn-csi-plugin-6zxtn                           0/3     CrashLoopBackOff   18 (2m41s ago)   9m16s
longhorn-csi-plugin-6zxtn                           0/3     CrashLoopBackOff   18 (1s ago)      9m17s
longhorn-csi-plugin-6zxtn                           1/3     CrashLoopBackOff   19 (2m27s ago)   11m
longhorn-csi-plugin-6zxtn                           1/3     CrashLoopBackOff   20 (2s ago)      11m
longhorn-csi-plugin-6zxtn                           0/3     CrashLoopBackOff   20 (1s ago)      12m
longhorn-csi-plugin-6zxtn                           2/3     CrashLoopBackOff   22 (2m16s ago)   14m
longhorn-csi-plugin-6zxtn                           0/3     CrashLoopBackOff   22 (2m46s ago)   14m
longhorn-csi-plugin-6zxtn                           0/3     CrashLoopBackOff   22 (2s ago)      14m

Deleting longhorn-csi-plugin-6zxtn immediately resolved the issue.

ejweber · 2023-12-18T23:03:01Z

I replaced livenessprobe v2.11.0 with v2.10.0. While the other two containers went into a crash loop (as expected) when the longhorn-backend service was broken, the livenessprobe container did not.

W1218 22:53:48.550210  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:53:58.549453  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:54:08.549543  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:54:18.550311  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:54:28.550369  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:54:38.550120  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:54:48.549873  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:54:58.549510  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:55:08.549488  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:55:18.550327  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:55:28.549449  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:55:38.552287  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:55:48.549367  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:55:58.549464  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:56:08.549487  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:56:18.549671  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:56:28.549355  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:56:38.549645  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:56:48.549588  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:56:58.549452  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:57:08.549906  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:57:18.549325  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:57:28.549934  114434 connection.go:173] Still connecting to unix:///csi/csi.sock

Since the containers were deep into backoff when I restored the service, it took still took a while to recover (had to wait for the longhorn-csi-plugin container to be allowed to restart). However, full recovery eventually occured.

longhorn-csi-plugin-dgrgh                           2/3     CrashLoopBackOff   10 (92s ago)    5m36s
longhorn-csi-plugin-dgrgh                           1/3     CrashLoopBackOff   10 (2m2s ago)   6m6s
longhorn-csi-plugin-dgrgh                           1/3     CrashLoopBackOff   10 (15s ago)    6m20s
longhorn-csi-plugin-dgrgh                           2/3     CrashLoopBackOff   11 (30s ago)    6m35s
longhorn-csi-plugin-dgrgh                           3/3     Running            12 (2m53s ago)   8m58s

ejweber · 2023-12-18T23:09:52Z

Ideas to fix (in order of preference?):

Patch upstream livenessprobe to restore previous behavior. Revert to v2.10.0 (or v2.9.0, since we already have a Longhorn version of it) in the meantime.
Add a startup probe to the longhorn-csi-plugin container with a 5m window or configure its livenessProbe.initialDelaySeconds to 5m. Kubernetes will not kill this container for the first five minutes it is alive. This will keep it alive long enough for livenessprobe to connect with it whenever it comes back, but maintain the existing probe behavior after startup.
~~Rework longhorn-csi-plugin startup so that it does not crash if it cannot connect to the longhorn-backend service. This may prevent the offset CrashLoopBackoffs from occurring in the first place.~~ Kubernetes liveness probes will still fail, since the container is not ready. This only works as a potential addendum to 2.

yaleman · 2023-12-18T23:16:20Z

Mine seems to have been sorted by restarting the node, it's a transient weirdness unfortunately.. will send a support bundle once I get one.

ejweber · 2023-12-19T16:24:36Z

Mine seems to have been sorted by restarting the node, it's a transient weirdness unfortunately.. will send a support bundle once I get one.

Yes, I think that fits with the analysis I've done. Even without the bundle, can you confirm which version of the livenessprobe image you are running in your cluster? For example:

eweber@laptop:~> kl get pod -l app=longhorn-csi-plugin -oyaml | grep livenessprobe
      image: longhornio/livenessprobe:v2.9.0
      image: docker.io/longhornio/livenessprobe:v2.9.0
      imageID: docker.io/longhornio/livenessprobe@sha256:555e27888f3692ab4130616ce4de62f950bdac16da3af4c432e13a71604e0261
      image: longhornio/livenessprobe:v2.9.0
      image: docker.io/longhornio/livenessprobe:v2.9.0
      imageID: docker.io/longhornio/livenessprobe@sha256:555e27888f3692ab4130616ce4de62f950bdac16da3af4c432e13a71604e0261
      image: longhornio/livenessprobe:v2.9.0
      image: docker.io/longhornio/livenessprobe:v2.9.0
      imageID: docker.io/longhornio/livenessprobe@sha256:555e27888f3692ab4130616ce4de62f950bdac16da3af4c432e13a71604e0261

ejweber · 2023-12-22T19:17:08Z

I reviewed #6916 and it looks like we really do need to go to livenessprobe v2.11.0 if we want to mitigate the CVEs it mentions.

The upstream bug has some traction, but it's unlikely we'll see a version we can grab and test before v1.6.0 releases. I'll move forward with my mitigation PR.

Create an issue to revert the mitigation when we upgrade livenessprobe if the mitigation changes are approved and merged.

longhorn-io-github-bot · 2023-12-22T19:38:09Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps are at [BUG] CSI components CrashLoopBackOff, failed to connect to unix://csi/csi.sock after cluster restart #7116 (comment).
The mitigation test steps are at: Add a startup probe to the longhorn-csi-plugin container longhorn-manager#2388 (comment).
Is there a workaround for the issue? If so, where is it documented?
Delete the CSI pod. Assuming the original reason for the longhorn-csi-plugin container to fail is gone, a new running pod will be created in its place.
Does the PR include the explanation for the fix or the feature?
Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
The PR is at: Add a startup probe to the longhorn-csi-plugin container longhorn-manager#2388
Which areas/issues this PR might have potential impacts on?
We are only mitigating the issue for now, and the mitigation has a drawback. See: Add a startup probe to the longhorn-csi-plugin container longhorn-manager#2388 (comment).

yangchiu · 2023-12-25T05:49:19Z

I reviewed #6916 and it looks like we really do need to go to livenessprobe v2.11.0 if we want to mitigate the CVEs it mentions.

The upstream bug has some traction, but it's unlikely we'll see a version we can grab and test before v1.6.0 releases. I'll move forward with my mitigation PR.
* [x]  Create an issue to revert the mitigation when we upgrade livenessprobe if the mitigation changes are approved and merged.

ref: #7428

yangchiu · 2023-12-26T00:25:23Z

Verified passed on master-head (longhorn-manager 859d438) following test steps #7116 (comment) and longhorn/longhorn-manager#2388 (comment). longhorn-csi-plugin pods can run stable after backend service restored.

Also ran negative test case Restart Cluster While Workload Heavy Writing for 60 times, didn't observe this issue. Test results:
https://ci.longhorn.io/job/private/job/longhorn-e2e-test/133/
https://ci.longhorn.io/job/private/job/longhorn-e2e-test/134/
https://ci.longhorn.io/job/private/job/longhorn-e2e-test/135/

yangchiu added kind/bug reproduce/rare < 50% reproducible require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Nov 16, 2023

yangchiu added this to the v1.6.0 milestone Nov 16, 2023

innobead assigned ejweber Nov 16, 2023

ejweber mentioned this issue Nov 20, 2023

Add privileged security context to be able to mount socket-dir longhorn/longhorn-manager#2178

Closed

innobead added the priority/1 Highly recommended to fix in this release (managed by PO) label Dec 6, 2023

ejweber mentioned this issue Dec 18, 2023

[TASK] Synchronize version of CSI components in longhorn/longhorn and longhorn/longhorn-manager #7377

Closed

1 task

innobead added the area/stability System or volume stability label Dec 19, 2023

innobead added area/csi CSI related like control/node driver, sidecars backport/1.4.5 backport/1.5.4 labels Dec 23, 2023

github-actions bot mentioned this issue Dec 23, 2023

[BACKPORT][v1.5.4][BUG] CSI components CrashLoopBackOff, failed to connect to unix://csi/csi.sock after cluster restart #7426

Closed

github-actions bot mentioned this issue Dec 23, 2023

[BACKPORT][v1.4.5][BUG] CSI components CrashLoopBackOff, failed to connect to unix://csi/csi.sock after cluster restart #7427

Closed

innobead mentioned this issue Dec 23, 2023

[IMPROVEMENT] Remove startup probe of CSI driver after liveness probe conn fix ready #7428

Closed

innobead added the area/upstream Upstream related like tgt upstream library label Dec 23, 2023

innobead assigned yangchiu Dec 23, 2023

yangchiu closed this as completed Dec 26, 2023

ejweber removed the backport/1.4.5 label Feb 7, 2024

This was referenced Feb 9, 2024

Use livenessprobe version with upstream fix #7908

Merged

Revert "Add a startup probe to the longhorn-csi-plugin container" longhorn/longhorn-manager#2578

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] CSI components CrashLoopBackOff, failed to connect to unix://csi/csi.sock after cluster restart #7116

[BUG] CSI components CrashLoopBackOff, failed to connect to unix://csi/csi.sock after cluster restart #7116

yangchiu commented Nov 16, 2023

ejweber commented Dec 4, 2023

PhanLe1010 commented Dec 9, 2023

ejweber commented Dec 11, 2023

yaleman commented Dec 16, 2023

ejweber commented Dec 18, 2023 •

edited

ejweber commented Dec 18, 2023 •

edited

ejweber commented Dec 18, 2023

ejweber commented Dec 18, 2023

ejweber commented Dec 18, 2023

ejweber commented Dec 18, 2023

yaleman commented Dec 18, 2023

ejweber commented Dec 19, 2023 •

edited

ejweber commented Dec 22, 2023 •

edited by yangchiu

longhorn-io-github-bot commented Dec 22, 2023 •

edited by ejweber

yangchiu commented Dec 25, 2023

yangchiu commented Dec 26, 2023

[BUG] CSI components CrashLoopBackOff, failed to connect to unix://csi/csi.sock after cluster restart #7116

[BUG] CSI components CrashLoopBackOff, failed to connect to unix://csi/csi.sock after cluster restart #7116

Comments

yangchiu commented Nov 16, 2023

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

ejweber commented Dec 4, 2023

PhanLe1010 commented Dec 9, 2023

ejweber commented Dec 11, 2023

yaleman commented Dec 16, 2023

ejweber commented Dec 18, 2023 • edited

ejweber commented Dec 18, 2023 • edited

ejweber commented Dec 18, 2023

ejweber commented Dec 18, 2023

ejweber commented Dec 18, 2023

ejweber commented Dec 18, 2023

yaleman commented Dec 18, 2023

ejweber commented Dec 19, 2023 • edited

ejweber commented Dec 22, 2023 • edited by yangchiu

longhorn-io-github-bot commented Dec 22, 2023 • edited by ejweber

Pre Ready-For-Testing Checklist

yangchiu commented Dec 25, 2023

yangchiu commented Dec 26, 2023

ejweber commented Dec 18, 2023 •

edited

ejweber commented Dec 18, 2023 •

edited

ejweber commented Dec 19, 2023 •

edited

ejweber commented Dec 22, 2023 •

edited by yangchiu

longhorn-io-github-bot commented Dec 22, 2023 •

edited by ejweber