-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
integrations/access: Make the plugins exit when the connection breaks instead of retrying infinetly and hanging #30039
Conversation
case trace.IsConnectionProblem(err): | ||
log.WithError(err).Error("Failed to connect to Teleport Auth server. Reconnecting...") | ||
// Not all connection problems can be retried. The client can | ||
// end up in a broken state and won't be able to connect. | ||
// Exiting in error is noisier but allows the orchestrator to | ||
// know something is not right. | ||
return trace.WrapWithMessage(err, "Failed to connect to Teleport server. Exiting.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this return
can break scenarios where an orchestrator is not watching the process.
Can we keep the previous logging message and rely on the backoff.Do
to prevent this from ever reaching the "broken state"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The backoff.Do won't recover, it will only throw errors explicitly every X seconds. To recover from such broken state we have to teardown the whole client. I don't think we can do it without leaking memory and goroutines all around the place.
What we could do to mitigate is assign a retry quota, like allow 3 reconnection attempts in 5 minutes, and crash if we exceed it. But this approach is worse for users using an orchestrator as the plugin won't fail fast.
Another approach would be to configure a TELEPORT_PLUGIN_FAIL_FAST environment variable in the chart and check if the var is set in the plugin. This would be backward compatible for non-chart users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach would be to configure a TELEPORT_PLUGIN_FAIL_FAST environment variable in the chart and check if the var is set in the plugin. This would be backward compatible for non-chart users.
Seems like a good idea 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented the feature flag in f56a732
This commit changes the watcherjob retry behaviour. Instead of relying on gRPC's retry mechanism, the plugins will now fail fast when something happens to the connection. This means the plugin will exit in error more often, but it won't be stuck in a retry loop, silently smallowing connection errors.
5965bbe
to
84b050e
Compare
84b050e
to
f56a732
Compare
@hugoShaka See the table below for backport results.
|
… instead of retrying infinetly and hanging (#30039) * integrations/access: avoid infinite retry on broken connection This commit changes the watcherjob retry behaviour. Instead of relying on gRPC's retry mechanism, the plugins will now fail fast when something happens to the connection. This means the plugin will exit in error more often, but it won't be stuck in a retry loop, silently smallowing connection errors. * Add TELEPORT_PLUGIN_FAIL_FAST environment variable * fixup! Add TELEPORT_PLUGIN_FAIL_FAST environment variable
… instead of retrying infinetly and hanging (#30039) (#30431) * integrations/access: avoid infinite retry on broken connection This commit changes the watcherjob retry behaviour. Instead of relying on gRPC's retry mechanism, the plugins will now fail fast when something happens to the connection. This means the plugin will exit in error more often, but it won't be stuck in a retry loop, silently smallowing connection errors. * Add TELEPORT_PLUGIN_FAIL_FAST environment variable * fixup! Add TELEPORT_PLUGIN_FAIL_FAST environment variable
Fixes gravitational/teleport-plugins#871
This commit changes the watcherjob retry behaviour when
TELEPORT_PLUGIN_FAIL_FAST
is set. Instead of relying on gRPC's retry mechanism, the plugins will now fail fast when something happens to the connection. This means the plugin will exit in error more often, but it won't be stuck in a retry loop, silently swallowing connection errors.Note for reviewers:
this is a potentially disrupting change as we will now exit in error in cases that might have been retriablethis was gated behind a flag