Fix: handle connection state changes #354

mem · 2022-11-02T22:20:56Z

While trying to figure out why we are seeing a GOAWAY HTTP2 frame coming from the "API" (actually nginx), Adrian found out that the only way to handle that is by watching the connection state changes.

Add another goroutine that is in charge of monitoring the state changes. If the connection goes to IDLE or SHUTDOWN, report an error from the goroutine so that it breaks out of the loop. This causes the agent to reconnect.

Our best understanding of the issue is that nginx has a hard limit on the number of requests that can be made over a single connection and when it reaches that limit, it sends a GOAWAY frame. The pings that we are sending to keep the connection alive are making this worse becase we send one every 90 seconds and the current limit in nginx is 100 (that gives us 9000 = 2.5 hours, not counting the additional requests that are part of the regular communication between the API and the agent, so that's the upper limit; that seems to match observed behavior).

See https://github.com/grafana/deployment_tools/pull/47478

Signed-off-by: Marcelo E. Magallon marcelo.magallon@grafana.com

While trying to figure out why we are seeing a GOAWAY HTTP2 frame coming from the "API" (actually nginx), Adrian found out that the only way to handle that is by watching the connection state changes. Add another goroutine that is in charge of monitoring the state changes. If the connection goes to IDLE or SHUTDOWN, report an error from the goroutine so that it breaks out of the loop. This causes the agent to reconnect. Our best understanding of the issue is that nginx has a hard limit on the number of requests that can be made over a single connection and when it reaches that limit, it sends a GOAWAY frame. The pings that we are sending to keep the connection alive are making this worse becase we send one every 90 seconds and the current limit in nginx is 100 (that gives us 9000 = 2.5 hours, not counting the additional requests that are part of the regular communication between the API and the agent, so that's the upper limit; that seems to match observed behavior). See grafana/deployment_tools#47478 Signed-off-by: Marcelo E. Magallon <marcelo.magallon@grafana.com>

This reverts (most of) grafana#354, as we figured out a way to solve the connectivity issues with configuration changes in NGINX instead of needing this additional logic. Keeps the updated dependencies and additional logs. Signed-off-by: Adrian Serrano <adrisr83@gmail.com>

This reverts (most of) #354, as we figured out a way to solve the connectivity issues with configuration changes in NGINX instead of needing this additional logic. Keeps the updated dependencies and additional logs. Signed-off-by: Adrian Serrano <adrisr83@gmail.com>

mem requested a review from a team as a code owner November 2, 2022 22:20

adriansr approved these changes Nov 3, 2022

View reviewed changes

mem merged commit cb2fc30 into main Nov 3, 2022

mem deleted the handle_connection_state_change branch November 3, 2022 13:49

adriansr mentioned this pull request Nov 9, 2022

Revert: handle connection state changes #366

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: handle connection state changes #354

Fix: handle connection state changes #354

mem commented Nov 2, 2022

Fix: handle connection state changes #354

Fix: handle connection state changes #354

Conversation

mem commented Nov 2, 2022