HTTP/2 liveness probe #1580

olix0r · 2018-09-04T21:24:58Z

When the proxy is not actively communicating with an endpoint, for instance because there isn't enough load in the system to send requests to all endpoints in a load balancer, the proxy's view of liveness can become stale (since liveness is informed by trying to use a service).

HTTP/2 PING messages can be used to determine if an endpoint's networking stack is alive. The proxy should ping idle endpoints to test liveness such that the endpoint fails or becomes not ready as appropriate, ultimately so that the load balancer does not consider endpoints that do not respond to ping.

Furthermore, these PINGs should be exposed outside of the h2 client so that we can, for instance, increment a counter tracking pings/latency.

The text was updated successfully, but these errors were encountered:

seanmonstar · 2018-09-04T23:16:43Z

From a liveness perspective, we should record why TCP's keep alive probes aren't sufficient for knowing if a connection is still usable, since pinging in HTTP2 both adds complexity and congestion.

olix0r · 2018-09-04T23:33:24Z

IIUC these test different things: TCP keepalive tests that the operating system is running, whereas H2 PINGs test that an instance is responsive. Furthermore, pings provide a means for an application to know the RTT of a connection, whereas TCP keepalives do not.

seanmonstar · 2018-09-04T23:50:38Z

If a process has crashed, the OS should send us a FIN or RST. If something gets in the way, like the OS crashing, or the network being disabled, the keep alive probes should detect it eventually.

For RTT, it's true that we can calculate that in-process using HTTP2 PINGs, though if we're running on Linux, we can check the TCP_INFO socket option to get that value from the OS' TCP stats.

olix0r · 2018-09-05T00:04:03Z

I'm less concerned about when a process has crashed entirely -- the OS can help in that case -- and more concerned about the receiver process being in a state where it isn't processing I/O (for example, a service that stuck in a bad GC state).

stale · 2019-01-29T15:17:30Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

adleong · 2020-01-06T23:36:05Z

This can also happen when a node is under very high load and the application becomes unresponsive even though the host OS is still responding to TCP keepalives, as described in #3854.

This can be simulated by using kind and pausing a node container.

stale · 2020-04-06T00:25:50Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale · 2020-07-23T06:02:07Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale · 2020-10-22T04:28:28Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

olix0r · 2020-11-12T05:55:11Z

Hyper supports this, we should enable it https://github.com/hyperium/hyper/blob/master/src/server/mod.rs#L400-L409

olix0r · 2020-11-12T16:33:25Z

I don't the metrics part of the initial issue description is particularly critical -- it's more important to get keepalives working.

hawkw · 2020-11-12T18:03:22Z

@olix0r do you imagine we'd add a new env var for configuring keepalive PINGs, or just reuse one of the existing ones (i imagine either {INBOUND, OUTBOUND}_{CONNECT, ACCEPT}_KEEPALIVE (which configures TCP keepalive), or {INBOUND, OUTBOUND}_MAX_IDLE_AGE)?

olix0r · 2020-11-12T18:30:37Z

@hawkw i'd be inclined to use the existing keepalive config

This branch enables HTTP/2 PING frames in the proxy's HTTP/2 clients and servers. The timeout for responding to a PING frame is configured based on the `{INBOUND, OUTBOUND}_{CONNECT, ACCEPT}_KEEPALIVE` env variables, and the interval between PING frames is currently 1/4th of the timeout. I'm happy to change that if anyone has better ideas. Collecting metrics related to H2 PINGs probably requires support in Hyper that doesn't currently exist, so this PR doesn't add that. We can implement metrics in a follow-up, as it's lower priority. Closes linkerd/linkerd2#1580

olix0r added the area/proxy label Sep 4, 2018

olix0r assigned seanmonstar Sep 4, 2018

adleong added this to To do in Reliability Sep 4, 2018

seanmonstar removed their assignment Oct 30, 2018

stale bot added the wontfix label Jan 29, 2019

stale bot closed this as completed Feb 12, 2019

adleong reopened this Jan 6, 2020

stale bot removed the wontfix label Jan 6, 2020

adleong mentioned this issue Jan 6, 2020

linkerd keeps talking with Terminating pods #3854

Closed

stale bot added the wontfix label Apr 6, 2020

stale bot closed this as completed Apr 20, 2020

olix0r reopened this Apr 23, 2020

stale bot removed the wontfix label Apr 23, 2020

stale bot added the wontfix label Jul 23, 2020

ihcsim removed the wontfix label Jul 23, 2020

stale bot added the wontfix label Oct 22, 2020

stale bot closed this as completed Nov 7, 2020

olix0r reopened this Nov 12, 2020

stale bot removed the wontfix label Nov 12, 2020

olix0r added this to To do in 2.10 - backlog via automation Nov 12, 2020

olix0r added the priority/P0 Release Blocker label Nov 12, 2020

olix0r assigned hawkw Nov 12, 2020

hawkw mentioned this issue Nov 12, 2020

h2: enable HTTP/2 keepalive PING frames linkerd/linkerd2-proxy#737

Merged

olix0r closed this as completed in linkerd/linkerd2-proxy#737 Nov 12, 2020

2.10 - backlog automation moved this from To do to Done Nov 12, 2020

github-actions bot locked as resolved and limited conversation to collaborators Jul 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTTP/2 liveness probe #1580

HTTP/2 liveness probe #1580

olix0r commented Sep 4, 2018

seanmonstar commented Sep 4, 2018

olix0r commented Sep 4, 2018

seanmonstar commented Sep 4, 2018

olix0r commented Sep 5, 2018

stale bot commented Jan 29, 2019

adleong commented Jan 6, 2020

stale bot commented Apr 6, 2020

stale bot commented Jul 23, 2020

stale bot commented Oct 22, 2020

olix0r commented Nov 12, 2020

olix0r commented Nov 12, 2020

hawkw commented Nov 12, 2020

olix0r commented Nov 12, 2020

HTTP/2 liveness probe #1580

HTTP/2 liveness probe #1580

Comments

olix0r commented Sep 4, 2018

seanmonstar commented Sep 4, 2018

olix0r commented Sep 4, 2018

seanmonstar commented Sep 4, 2018

olix0r commented Sep 5, 2018

stale bot commented Jan 29, 2019

adleong commented Jan 6, 2020

stale bot commented Apr 6, 2020

stale bot commented Jul 23, 2020

stale bot commented Oct 22, 2020

olix0r commented Nov 12, 2020

olix0r commented Nov 12, 2020

hawkw commented Nov 12, 2020

olix0r commented Nov 12, 2020