-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New csi-lib-utils/connection.Connect logic can cause permanent CSI plugin outage #236
Comments
Although that change sets a reasonable default for every other sidecar, the livenessprobe is special and as you say probably needs to either actually attempt to connect infinitely or ignore this error.
It could, but I don't think that's as big of a deal. In the node-driver-registrar case, this bug leads to the driver never getting registered on the node - effectively permanently bricking the driver until manual intervention. In the livenessprobe case, the pod will eventually fail the check and be restarted, so the worst case scenario is a temporary delay of driver startup in a very rare race condition. |
To me it seems the liveness probe container should never crash when it can't talk to the container, it should just report failed probes. Would it be possible to achieve with the current connection lib? It would make sense even to use grpc directly, if our lib is not flexible enough. cc @msau42 |
What about the caller sending the option |
There is |
I missed the possibility of setting it to 0 in #237, so I set it to 5 minutes. Your approach makes more logical sense. |
@ejweber hey, sorry, I missed your PR... it was almost right! |
Longhorn 7428 This reverts commit fe405e6. We no longer need these changes because livenessprobe was fixed upstream. kubernetes-csi/livenessprobe#236 Signed-off-by: Eric Weber <eric.weber@suse.com>
Longhorn 7428 This reverts commit fe405e6. We no longer need these changes because livenessprobe was fixed upstream. kubernetes-csi/livenessprobe#236 Signed-off-by: Eric Weber <eric.weber@suse.com>
Longhorn 7428 This reverts commit fe405e6. We no longer need these changes because livenessprobe was fixed upstream. kubernetes-csi/livenessprobe#236 Signed-off-by: Eric Weber <eric.weber@suse.com>
Longhorn 7428 This reverts commit fe405e6. We no longer need these changes because livenessprobe was fixed upstream. kubernetes-csi/livenessprobe#236 Signed-off-by: Eric Weber <eric.weber@suse.com> (cherry picked from commit 68ed92b)
Longhorn 7428 This reverts commit fe405e6. We no longer need these changes because livenessprobe was fixed upstream. kubernetes-csi/livenessprobe#236 Signed-off-by: Eric Weber <eric.weber@suse.com> (cherry picked from commit 68ed92b)
Longhorn 7428 This reverts commit fe405e6. We no longer need these changes because livenessprobe was fixed upstream. kubernetes-csi/livenessprobe#236 Signed-off-by: Eric Weber <eric.weber@suse.com> (cherry picked from commit 68ed92b)
Longhorn 7428 This reverts commit fe405e6. We no longer need these changes because livenessprobe was fixed upstream. kubernetes-csi/livenessprobe#236 Signed-off-by: Eric Weber <eric.weber@suse.com> (cherry picked from commit 68ed92b)
upgrade livenessprobe container to v2.12.0 ref: kubernetes-csi/livenessprobe#236
upgrade livenessprobe container to v2.12.0 ref: kubernetes-csi/livenessprobe#236
The livenessprobe code expects to try forever to connect with the CSI plugin via
csi.sock
on startup.livenessprobe/cmd/livenessprobe/main.go
Lines 142 to 147 in 33ea6c0
However, this commit recently picked up a change in csi-lib-utils that returns an error after only 30 seconds.
According to the associated PR, the goal was to avoid a deadlock in which
node-driver-registrar
failed permanently to connect to a CSI plugin because it was referencing an old file descriptor.In this analysis, I described a situation in which this new behavior caused a permanent outage of the Longhorn CSI plugin. Details are there, but essentially:
IMO, livenessprobe's previous behavior was correct. It should not crash unless it is misconfigured so it is always available to answer kubelet's liveness probes.
Assuming the csi-lib-utils change was necessary, my thinking is that we should recognize the timeout error in livenessprobe and ignore it during initialization. However, I'm not I understand the exact cause of kubernetes-csi/csi-lib-utils#131. Maybe this could similarly lead to a liveness probe stuck permanently in initialization?
cc @ConnorJC3 from the csi-lib-utils PR for any thoughts.
The text was updated successfully, but these errors were encountered: