Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: handle connection state changes #354

Merged
merged 1 commit into from
Nov 3, 2022
Merged

Conversation

mem
Copy link
Contributor

@mem mem commented Nov 2, 2022

While trying to figure out why we are seeing a GOAWAY HTTP2 frame coming from the "API" (actually nginx), Adrian found out that the only way to handle that is by watching the connection state changes.

Add another goroutine that is in charge of monitoring the state changes. If the connection goes to IDLE or SHUTDOWN, report an error from the goroutine so that it breaks out of the loop. This causes the agent to reconnect.

Our best understanding of the issue is that nginx has a hard limit on the number of requests that can be made over a single connection and when it reaches that limit, it sends a GOAWAY frame. The pings that we are sending to keep the connection alive are making this worse becase we send one every 90 seconds and the current limit in nginx is 100 (that gives us 9000 = 2.5 hours, not counting the additional requests that are part of the regular communication between the API and the agent, so that's the upper limit; that seems to match observed behavior).

See https://github.com/grafana/deployment_tools/pull/47478

Signed-off-by: Marcelo E. Magallon marcelo.magallon@grafana.com

While trying to figure out why we are seeing a GOAWAY HTTP2 frame coming
from the "API" (actually nginx), Adrian found out that the only way to
handle that is by watching the connection state changes.

Add another goroutine that is in charge of monitoring the state changes.
If the connection goes to IDLE or SHUTDOWN, report an error from the
goroutine so that it breaks out of the loop. This causes the agent to
reconnect.

Our best understanding of the issue is that nginx has a hard limit on
the number of requests that can be made over a single connection and
when it reaches that limit, it sends a GOAWAY frame. The pings that we
are sending to keep the connection alive are making this worse becase we
send one every 90 seconds and the current limit in nginx is 100 (that
gives us 9000 = 2.5 hours, not counting the additional requests that are
part of the regular communication between the API and the agent, so
that's the upper limit; that seems to match observed behavior).

See grafana/deployment_tools#47478

Signed-off-by: Marcelo E. Magallon <marcelo.magallon@grafana.com>
@mem mem requested a review from a team as a code owner November 2, 2022 22:20
@mem mem merged commit cb2fc30 into main Nov 3, 2022
@mem mem deleted the handle_connection_state_change branch November 3, 2022 13:49
adriansr added a commit to adriansr/synthetic-monitoring-agent that referenced this pull request Nov 9, 2022
This reverts (most of) grafana#354, as we figured out a way to solve the
connectivity issues with configuration changes in NGINX instead of
needing this additional logic.

Keeps the updated dependencies and additional logs.

Signed-off-by: Adrian Serrano <adrisr83@gmail.com>
adriansr added a commit that referenced this pull request Nov 9, 2022
This reverts (most of) #354, as we figured out a way to solve the
connectivity issues with configuration changes in NGINX instead of
needing this additional logic.

Keeps the updated dependencies and additional logs.

Signed-off-by: Adrian Serrano <adrisr83@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants