New csi-lib-utils/connection.Connect logic can cause permanent CSI plugin outage #236

ejweber · 2023-12-19T18:21:58Z

The livenessprobe code expects to try forever to connect with the CSI plugin via csi.sock on startup.

Lines 142 to 147 in 33ea6c0

    
           csiConn, err := acquireConnection(context.Background(), metricsManager) 
        
           if err != nil { 
        
           	// connlib should retry forever so a returned error should mean 
        
           	// the grpc client is misconfigured rather than an error on the network 
        
           	klog.Fatalf("failed to establish connection to CSI driver: %v", err) 
        
           }

However, this commit recently picked up a change in csi-lib-utils that returns an error after only 30 seconds.

According to the associated PR, the goal was to avoid a deadlock in which node-driver-registrar failed permanently to connect to a CSI plugin because it was referencing an old file descriptor.

In this analysis, I described a situation in which this new behavior caused a permanent outage of the Longhorn CSI plugin. Details are there, but essentially:

The CSI plugin fails to start for an ephemeral reason and enters a CrashLoopBackOff.
livenessprobe fails to connect and enters a CrashLoopBackOff.
Eventually, the CSI plugin can start successfully. Since livenessprobe is not running at that time, kubelet kills it, increasing the backoff.
Every time livenessprobe starts, the CSI plugin is waiting in backoff, so livenessprobe crashes, increasing the backoff.

IMO, livenessprobe's previous behavior was correct. It should not crash unless it is misconfigured so it is always available to answer kubelet's liveness probes.

Assuming the csi-lib-utils change was necessary, my thinking is that we should recognize the timeout error in livenessprobe and ignore it during initialization. However, I'm not I understand the exact cause of kubernetes-csi/csi-lib-utils#131. Maybe this could similarly lead to a liveness probe stuck permanently in initialization?

cc @ConnorJC3 from the csi-lib-utils PR for any thoughts.

The text was updated successfully, but these errors were encountered:

ConnorJC3 · 2023-12-20T16:36:12Z

Although that change sets a reasonable default for every other sidecar, the livenessprobe is special and as you say probably needs to either actually attempt to connect infinitely or ignore this error.

Maybe this could similarly lead to a liveness probe stuck permanently in initialization?

It could, but I don't think that's as big of a deal. In the node-driver-registrar case, this bug leads to the driver never getting registered on the node - effectively permanently bricking the driver until manual intervention. In the livenessprobe case, the pod will eventually fail the check and be restarted, so the worst case scenario is a temporary delay of driver startup in a very rare race condition.

jsafrane · 2023-12-21T10:57:16Z

To me it seems the liveness probe container should never crash when it can't talk to the container, it should just report failed probes. Would it be possible to achieve with the current connection lib? It would make sense even to use grpc directly, if our lib is not flexible enough.

cc @msau42

mauriciopoppe · 2023-12-21T16:01:05Z

What about the caller sending the option grpc.WithTimeout(time.Second * 30)? livenessprobe wouldn't sent this option, all the other sidecars would send it. Looks like the timeout is also set in ConnectWithoutMetrics

jsafrane · 2024-01-04T13:50:10Z

There is connlib.Connect(..., WithTimeout(0)) that disables the default 30 second timeout. I am testing it in #236.

ejweber · 2024-01-04T16:19:57Z

I missed the possibility of setting it to 0 in #237, so I set it to 5 minutes. Your approach makes more logical sense.

jsafrane · 2024-01-04T16:35:54Z

@ejweber hey, sorry, I missed your PR... it was almost right!

Longhorn 7428 This reverts commit fe405e6. We no longer need these changes because livenessprobe was fixed upstream. kubernetes-csi/livenessprobe#236 Signed-off-by: Eric Weber <eric.weber@suse.com>

Longhorn 7428 This reverts commit fe405e6. We no longer need these changes because livenessprobe was fixed upstream. kubernetes-csi/livenessprobe#236 Signed-off-by: Eric Weber <eric.weber@suse.com> (cherry picked from commit 68ed92b)

upgrade livenessprobe container to v2.12.0 ref: kubernetes-csi/livenessprobe#236

ejweber mentioned this issue Dec 19, 2023

Prevent a crash loop caused by upstream changes #237

Closed

jsafrane mentioned this issue Jan 4, 2024

Don't exit the probe on connection issues #240

Merged

k8s-ci-robot closed this as completed in #240 Jan 5, 2024

This was referenced Feb 9, 2024

Revert "Add a startup probe to the longhorn-csi-plugin container" longhorn/longhorn-manager#2578

Merged

Use livenessprobe version with upstream fix longhorn/longhorn#7908

Merged

bboerst mentioned this issue Feb 16, 2024

Upgrade to latest liveness-probe v2.12.0 for inclusion of bug fix kubernetes-sigs/aws-ebs-csi-driver#1935

Closed

philnielsen mentioned this issue Apr 2, 2024

Stuck in Still connecting to unix:///csi/csi.sock kubernetes-sigs/aws-efs-csi-driver#1301

Closed

zxh326 added a commit to juicedata/charts that referenced this issue May 8, 2024

Update values.yaml

245077a

upgrade livenessprobe container to v2.12.0 ref: kubernetes-csi/livenessprobe#236

zxh326 mentioned this issue May 8, 2024

Update values.yaml juicedata/charts#100

Merged

zxh326 added a commit to juicedata/charts that referenced this issue May 8, 2024

Update values.yaml (#100)

c83f18f

upgrade livenessprobe container to v2.12.0 ref: kubernetes-csi/livenessprobe#236

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New csi-lib-utils/connection.Connect logic can cause permanent CSI plugin outage #236

New csi-lib-utils/connection.Connect logic can cause permanent CSI plugin outage #236

ejweber commented Dec 19, 2023 •

edited

Loading

ConnorJC3 commented Dec 20, 2023

jsafrane commented Dec 21, 2023

mauriciopoppe commented Dec 21, 2023

jsafrane commented Jan 4, 2024

ejweber commented Jan 4, 2024

jsafrane commented Jan 4, 2024

New csi-lib-utils/connection.Connect logic can cause permanent CSI plugin outage #236

New csi-lib-utils/connection.Connect logic can cause permanent CSI plugin outage #236

Comments

ejweber commented Dec 19, 2023 • edited Loading

ConnorJC3 commented Dec 20, 2023

jsafrane commented Dec 21, 2023

mauriciopoppe commented Dec 21, 2023

jsafrane commented Jan 4, 2024

ejweber commented Jan 4, 2024

jsafrane commented Jan 4, 2024

ejweber commented Dec 19, 2023 •

edited

Loading