Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP/2 liveness probe #1580

Closed
olix0r opened this issue Sep 4, 2018 · 13 comments · Fixed by linkerd/linkerd2-proxy#737
Closed

HTTP/2 liveness probe #1580

olix0r opened this issue Sep 4, 2018 · 13 comments · Fixed by linkerd/linkerd2-proxy#737
Assignees
Labels

Comments

@olix0r
Copy link
Member

olix0r commented Sep 4, 2018

When the proxy is not actively communicating with an endpoint, for instance because there isn't enough load in the system to send requests to all endpoints in a load balancer, the proxy's view of liveness can become stale (since liveness is informed by trying to use a service).

HTTP/2 PING messages can be used to determine if an endpoint's networking stack is alive. The proxy should ping idle endpoints to test liveness such that the endpoint fails or becomes not ready as appropriate, ultimately so that the load balancer does not consider endpoints that do not respond to ping.

Furthermore, these PINGs should be exposed outside of the h2 client so that we can, for instance, increment a counter tracking pings/latency.

@seanmonstar
Copy link
Contributor

From a liveness perspective, we should record why TCP's keep alive probes aren't sufficient for knowing if a connection is still usable, since pinging in HTTP2 both adds complexity and congestion.

@olix0r
Copy link
Member Author

olix0r commented Sep 4, 2018

IIUC these test different things: TCP keepalive tests that the operating system is running, whereas H2 PINGs test that an instance is responsive. Furthermore, pings provide a means for an application to know the RTT of a connection, whereas TCP keepalives do not.

@seanmonstar
Copy link
Contributor

If a process has crashed, the OS should send us a FIN or RST. If something gets in the way, like the OS crashing, or the network being disabled, the keep alive probes should detect it eventually.

For RTT, it's true that we can calculate that in-process using HTTP2 PINGs, though if we're running on Linux, we can check the TCP_INFO socket option to get that value from the OS' TCP stats.

@olix0r
Copy link
Member Author

olix0r commented Sep 5, 2018

I'm less concerned about when a process has crashed entirely -- the OS can help in that case -- and more concerned about the receiver process being in a state where it isn't processing I/O (for example, a service that stuck in a bad GC state).

@seanmonstar seanmonstar removed their assignment Oct 30, 2018
@stale
Copy link

stale bot commented Jan 29, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jan 29, 2019
@stale stale bot closed this as completed Feb 12, 2019
@adleong adleong reopened this Jan 6, 2020
@stale stale bot removed the wontfix label Jan 6, 2020
@adleong
Copy link
Member

adleong commented Jan 6, 2020

This can also happen when a node is under very high load and the application becomes unresponsive even though the host OS is still responding to TCP keepalives, as described in #3854.

This can be simulated by using kind and pausing a node container.

@stale
Copy link

stale bot commented Apr 6, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Apr 6, 2020
@stale stale bot closed this as completed Apr 20, 2020
@olix0r olix0r reopened this Apr 23, 2020
@stale stale bot removed the wontfix label Apr 23, 2020
@stale
Copy link

stale bot commented Jul 23, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 23, 2020
@ihcsim ihcsim removed the wontfix label Jul 23, 2020
@stale
Copy link

stale bot commented Oct 22, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Oct 22, 2020
@stale stale bot closed this as completed Nov 7, 2020
@olix0r
Copy link
Member Author

olix0r commented Nov 12, 2020

Hyper supports this, we should enable it https://github.com/hyperium/hyper/blob/master/src/server/mod.rs#L400-L409

@olix0r olix0r reopened this Nov 12, 2020
@stale stale bot removed the wontfix label Nov 12, 2020
@olix0r olix0r added this to To do in 2.10 - backlog via automation Nov 12, 2020
@olix0r olix0r added the priority/P0 Release Blocker label Nov 12, 2020
@olix0r
Copy link
Member Author

olix0r commented Nov 12, 2020

I don't the metrics part of the initial issue description is particularly critical -- it's more important to get keepalives working.

@hawkw
Copy link
Member

hawkw commented Nov 12, 2020

@olix0r do you imagine we'd add a new env var for configuring keepalive PINGs, or just reuse one of the existing ones (i imagine either {INBOUND, OUTBOUND}_{CONNECT, ACCEPT}_KEEPALIVE (which configures TCP keepalive), or {INBOUND, OUTBOUND}_MAX_IDLE_AGE)?

@olix0r
Copy link
Member Author

olix0r commented Nov 12, 2020

@hawkw i'd be inclined to use the existing keepalive config

2.10 - backlog automation moved this from To do to Done Nov 12, 2020
olix0r pushed a commit to linkerd/linkerd2-proxy that referenced this issue Nov 12, 2020
This branch enables HTTP/2 PING frames in the proxy's HTTP/2 clients and
servers. The timeout for responding to a PING frame is configured based
on the `{INBOUND, OUTBOUND}_{CONNECT, ACCEPT}_KEEPALIVE` env variables,
and the interval between PING frames is currently 1/4th of the timeout.
I'm happy to change that if anyone has better ideas.

Collecting metrics related to H2 PINGs probably requires support in 
Hyper that doesn't currently exist, so this PR doesn't add that. We 
can implement metrics in a follow-up, as it's lower priority.

Closes linkerd/linkerd2#1580
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 17, 2021
panthervis added a commit to panthervis/linkerd2-proxy that referenced this issue Oct 8, 2021
This branch enables HTTP/2 PING frames in the proxy's HTTP/2 clients and
servers. The timeout for responding to a PING frame is configured based
on the `{INBOUND, OUTBOUND}_{CONNECT, ACCEPT}_KEEPALIVE` env variables,
and the interval between PING frames is currently 1/4th of the timeout.
I'm happy to change that if anyone has better ideas.

Collecting metrics related to H2 PINGs probably requires support in 
Hyper that doesn't currently exist, so this PR doesn't add that. We 
can implement metrics in a follow-up, as it's lower priority.

Closes linkerd/linkerd2#1580
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

5 participants