-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTTP/2 liveness probe #1580
Comments
From a liveness perspective, we should record why TCP's keep alive probes aren't sufficient for knowing if a connection is still usable, since pinging in HTTP2 both adds complexity and congestion. |
IIUC these test different things: TCP keepalive tests that the operating system is running, whereas H2 PINGs test that an instance is responsive. Furthermore, pings provide a means for an application to know the RTT of a connection, whereas TCP keepalives do not. |
If a process has crashed, the OS should send us a FIN or RST. If something gets in the way, like the OS crashing, or the network being disabled, the keep alive probes should detect it eventually. For RTT, it's true that we can calculate that in-process using HTTP2 |
I'm less concerned about when a process has crashed entirely -- the OS can help in that case -- and more concerned about the receiver process being in a state where it isn't processing I/O (for example, a service that stuck in a bad GC state). |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
This can also happen when a node is under very high load and the application becomes unresponsive even though the host OS is still responding to TCP keepalives, as described in #3854. This can be simulated by using |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
Hyper supports this, we should enable it https://github.com/hyperium/hyper/blob/master/src/server/mod.rs#L400-L409 |
I don't the metrics part of the initial issue description is particularly critical -- it's more important to get keepalives working. |
@olix0r do you imagine we'd add a new env var for configuring keepalive PINGs, or just reuse one of the existing ones (i imagine either |
@hawkw i'd be inclined to use the existing keepalive config |
This branch enables HTTP/2 PING frames in the proxy's HTTP/2 clients and servers. The timeout for responding to a PING frame is configured based on the `{INBOUND, OUTBOUND}_{CONNECT, ACCEPT}_KEEPALIVE` env variables, and the interval between PING frames is currently 1/4th of the timeout. I'm happy to change that if anyone has better ideas. Collecting metrics related to H2 PINGs probably requires support in Hyper that doesn't currently exist, so this PR doesn't add that. We can implement metrics in a follow-up, as it's lower priority. Closes linkerd/linkerd2#1580
This branch enables HTTP/2 PING frames in the proxy's HTTP/2 clients and servers. The timeout for responding to a PING frame is configured based on the `{INBOUND, OUTBOUND}_{CONNECT, ACCEPT}_KEEPALIVE` env variables, and the interval between PING frames is currently 1/4th of the timeout. I'm happy to change that if anyone has better ideas. Collecting metrics related to H2 PINGs probably requires support in Hyper that doesn't currently exist, so this PR doesn't add that. We can implement metrics in a follow-up, as it's lower priority. Closes linkerd/linkerd2#1580
When the proxy is not actively communicating with an endpoint, for instance because there isn't enough load in the system to send requests to all endpoints in a load balancer, the proxy's view of liveness can become stale (since liveness is informed by trying to use a service).
HTTP/2
PING
messages can be used to determine if an endpoint's networking stack is alive. The proxy should ping idle endpoints to test liveness such that the endpoint fails or becomes not ready as appropriate, ultimately so that the load balancer does not consider endpoints that do not respond to ping.Furthermore, these PINGs should be exposed outside of the h2 client so that we can, for instance, increment a counter tracking pings/latency.
The text was updated successfully, but these errors were encountered: