Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integrations/access: Make the plugins exit when the connection breaks instead of retrying infinetly and hanging #30039

Merged
merged 4 commits into from
Aug 9, 2023

Conversation

hugoShaka
Copy link
Contributor

@hugoShaka hugoShaka commented Aug 4, 2023

Fixes gravitational/teleport-plugins#871

This commit changes the watcherjob retry behaviour when TELEPORT_PLUGIN_FAIL_FAST is set. Instead of relying on gRPC's retry mechanism, the plugins will now fail fast when something happens to the connection. This means the plugin will exit in error more often, but it won't be stuck in a retry loop, silently swallowing connection errors.

Note for reviewers:

  • this is a potentially disrupting change as we will now exit in error in cases that might have been retriable this was gated behind a flag
  • Testing this is hard I managed to reproduce by messing with network through iptables rules between my teleport plugin and its cluster.

@hugoShaka hugoShaka added backport/branch/v11 backport/branch/v13 bug teleport-plugin Tickets related to Teleport Plugins https://github.com/gravitational/teleport-plugins access-requests labels Aug 4, 2023
@hugoShaka hugoShaka changed the title integrations/access: avoid infinite retry on broken connection integrations/access: Make the plugins exit when the connection breaks insetad of retrying infinetly and hanging Aug 4, 2023
@hugoShaka hugoShaka changed the title integrations/access: Make the plugins exit when the connection breaks insetad of retrying infinetly and hanging integrations/access: Make the plugins exit when the connection breaks instead of retrying infinetly and hanging Aug 7, 2023
integrations/lib/watcherjob/watcherjob.go Show resolved Hide resolved
Comment on lines 95 to 100
case trace.IsConnectionProblem(err):
log.WithError(err).Error("Failed to connect to Teleport Auth server. Reconnecting...")
// Not all connection problems can be retried. The client can
// end up in a broken state and won't be able to connect.
// Exiting in error is noisier but allows the orchestrator to
// know something is not right.
return trace.WrapWithMessage(err, "Failed to connect to Teleport server. Exiting.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this return can break scenarios where an orchestrator is not watching the process.

Can we keep the previous logging message and rely on the backoff.Do to prevent this from ever reaching the "broken state"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backoff.Do won't recover, it will only throw errors explicitly every X seconds. To recover from such broken state we have to teardown the whole client. I don't think we can do it without leaking memory and goroutines all around the place.

What we could do to mitigate is assign a retry quota, like allow 3 reconnection attempts in 5 minutes, and crash if we exceed it. But this approach is worse for users using an orchestrator as the plugin won't fail fast.

Another approach would be to configure a TELEPORT_PLUGIN_FAIL_FAST environment variable in the chart and check if the var is set in the plugin. This would be backward compatible for non-chart users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach would be to configure a TELEPORT_PLUGIN_FAIL_FAST environment variable in the chart and check if the var is set in the plugin. This would be backward compatible for non-chart users.

Seems like a good idea 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented the feature flag in f56a732

This commit changes the watcherjob retry behaviour. Instead of relying
on gRPC's retry mechanism, the plugins will now fail fast when something
happens to the connection. This means the plugin will exit in error more
often, but it won't be stuck in a retry loop, silently smallowing
connection errors.
@hugoShaka hugoShaka force-pushed the hugo/fix-plugin-infinite-retry branch from 5965bbe to 84b050e Compare August 8, 2023 21:46
@public-teleport-github-review-bot public-teleport-github-review-bot bot removed the request for review from r0mant August 9, 2023 06:42
@hugoShaka hugoShaka added this pull request to the merge queue Aug 9, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 9, 2023
@hugoShaka hugoShaka added this pull request to the merge queue Aug 9, 2023
Merged via the queue into master with commit 7a4bdea Aug 9, 2023
21 checks passed
@hugoShaka hugoShaka deleted the hugo/fix-plugin-infinite-retry branch August 9, 2023 19:20
@public-teleport-github-review-bot

@hugoShaka See the table below for backport results.

Branch Result
branch/v11 Failed
branch/v12 Failed
branch/v13 Failed

hugoShaka added a commit that referenced this pull request Aug 14, 2023
… instead of retrying infinetly and hanging (#30039)

* integrations/access: avoid infinite retry on broken connection

This commit changes the watcherjob retry behaviour. Instead of relying
on gRPC's retry mechanism, the plugins will now fail fast when something
happens to the connection. This means the plugin will exit in error more
often, but it won't be stuck in a retry loop, silently smallowing
connection errors.

* Add TELEPORT_PLUGIN_FAIL_FAST environment variable

* fixup! Add TELEPORT_PLUGIN_FAIL_FAST environment variable
github-merge-queue bot pushed a commit that referenced this pull request Aug 14, 2023
… instead of retrying infinetly and hanging (#30039) (#30431)

* integrations/access: avoid infinite retry on broken connection

This commit changes the watcherjob retry behaviour. Instead of relying
on gRPC's retry mechanism, the plugins will now fail fast when something
happens to the connection. This means the plugin will exit in error more
often, but it won't be stuck in a retry loop, silently smallowing
connection errors.

* Add TELEPORT_PLUGIN_FAIL_FAST environment variable

* fixup! Add TELEPORT_PLUGIN_FAIL_FAST environment variable
@fheinecke fheinecke mentioned this pull request Sep 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
access-requests backport/branch/v13 bug size/sm teleport-plugin Tickets related to Teleport Plugins https://github.com/gravitational/teleport-plugins
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slack Plugin Errors with Interrupted Workflow
3 participants