integrations/access: Make the plugins exit when the connection breaks instead of retrying infinetly and hanging #30039

hugoShaka · 2023-08-04T14:16:32Z

Fixes gravitational/teleport-plugins#871

This commit changes the watcherjob retry behaviour when TELEPORT_PLUGIN_FAIL_FAST is set. Instead of relying on gRPC's retry mechanism, the plugins will now fail fast when something happens to the connection. This means the plugin will exit in error more often, but it won't be stuck in a retry loop, silently swallowing connection errors.

Note for reviewers:

~~this is a potentially disrupting change as we will now exit in error in cases that might have been retriable~~ this was gated behind a flag
Testing this is hard I managed to reproduce by messing with network through iptables rules between my teleport plugin and its cluster.

integrations/lib/watcherjob/watcherjob.go

marcoandredinis · 2023-08-07T10:28:01Z

integrations/lib/watcherjob/watcherjob.go

 			case trace.IsConnectionProblem(err):
-				log.WithError(err).Error("Failed to connect to Teleport Auth server. Reconnecting...")
+				// Not all connection problems can be retried. The client can
+				// end up in a broken state and won't be able to connect.
+				// Exiting in error is noisier but allows the orchestrator to
+				// know something is not right.
+				return trace.WrapWithMessage(err, "Failed to connect to Teleport server. Exiting.")


Adding this return can break scenarios where an orchestrator is not watching the process.

Can we keep the previous logging message and rely on the backoff.Do to prevent this from ever reaching the "broken state"?

The backoff.Do won't recover, it will only throw errors explicitly every X seconds. To recover from such broken state we have to teardown the whole client. I don't think we can do it without leaking memory and goroutines all around the place.

What we could do to mitigate is assign a retry quota, like allow 3 reconnection attempts in 5 minutes, and crash if we exceed it. But this approach is worse for users using an orchestrator as the plugin won't fail fast.

Another approach would be to configure a TELEPORT_PLUGIN_FAIL_FAST environment variable in the chart and check if the var is set in the plugin. This would be backward compatible for non-chart users.

Another approach would be to configure a TELEPORT_PLUGIN_FAIL_FAST environment variable in the chart and check if the var is set in the plugin. This would be backward compatible for non-chart users.

Seems like a good idea 👍

I implemented the feature flag in f56a732

This commit changes the watcherjob retry behaviour. Instead of relying on gRPC's retry mechanism, the plugins will now fail fast when something happens to the connection. This means the plugin will exit in error more often, but it won't be stuck in a retry loop, silently smallowing connection errors.

public-teleport-github-review-bot · 2023-08-09T19:22:24Z

@hugoShaka See the table below for backport results.

Branch	Result
branch/v11	Failed
branch/v12	Failed
branch/v13	Failed

… instead of retrying infinetly and hanging (#30039) * integrations/access: avoid infinite retry on broken connection This commit changes the watcherjob retry behaviour. Instead of relying on gRPC's retry mechanism, the plugins will now fail fast when something happens to the connection. This means the plugin will exit in error more often, but it won't be stuck in a retry loop, silently smallowing connection errors. * Add TELEPORT_PLUGIN_FAIL_FAST environment variable * fixup! Add TELEPORT_PLUGIN_FAIL_FAST environment variable

… instead of retrying infinetly and hanging (#30039) (#30431) * integrations/access: avoid infinite retry on broken connection This commit changes the watcherjob retry behaviour. Instead of relying on gRPC's retry mechanism, the plugins will now fail fast when something happens to the connection. This means the plugin will exit in error more often, but it won't be stuck in a retry loop, silently smallowing connection errors. * Add TELEPORT_PLUGIN_FAIL_FAST environment variable * fixup! Add TELEPORT_PLUGIN_FAIL_FAST environment variable

hugoShaka requested review from marcoandredinis, fspmarshall and r0mant August 4, 2023 14:16

github-actions bot added the size/sm label Aug 4, 2023

hugoShaka added backport/branch/v11 backport/branch/v13 bug teleport-plugin Tickets related to Teleport Plugins https://github.com/gravitational/teleport-plugins access-requests labels Aug 4, 2023

fspmarshall approved these changes Aug 4, 2023

View reviewed changes

hugoShaka changed the title ~~integrations/access: avoid infinite retry on broken connection~~ integrations/access: Make the plugins exit when the connection breaks insetad of retrying infinetly and hanging Aug 4, 2023

hugoShaka added the changelog label Aug 4, 2023

hugoShaka changed the title ~~integrations/access: Make the plugins exit when the connection breaks insetad of retrying infinetly and hanging~~ integrations/access: Make the plugins exit when the connection breaks instead of retrying infinetly and hanging Aug 7, 2023

marcoandredinis reviewed Aug 8, 2023

View reviewed changes

hugoShaka force-pushed the hugo/fix-plugin-infinite-retry branch from 5965bbe to 84b050e Compare August 8, 2023 21:46

Add TELEPORT_PLUGIN_FAIL_FAST environment variable

f56a732

hugoShaka force-pushed the hugo/fix-plugin-infinite-retry branch from 84b050e to f56a732 Compare August 8, 2023 21:57

hugoShaka requested a review from marcoandredinis August 8, 2023 21:57

marcoandredinis approved these changes Aug 9, 2023

View reviewed changes

public-teleport-github-review-bot bot removed the request for review from r0mant August 9, 2023 06:42

fixup! Add TELEPORT_PLUGIN_FAIL_FAST environment variable

f409af0

hugoShaka added this pull request to the merge queue Aug 9, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 9, 2023

Merge branch 'master' into hugo/fix-plugin-infinite-retry

e6faa5b

hugoShaka enabled auto-merge August 9, 2023 18:41

hugoShaka added this pull request to the merge queue Aug 9, 2023

Merged via the queue into master with commit 7a4bdea Aug 9, 2023
21 checks passed

hugoShaka deleted the hugo/fix-plugin-infinite-retry branch August 9, 2023 19:20

hugoShaka mentioned this pull request Aug 14, 2023

[v13] integrations/access: Make the plugins exit when the connection breaks instead of retrying infinetly and hanging #30431

Merged

ZhongRuoyu mentioned this pull request Aug 18, 2023

teleport 13.3.4 Homebrew/homebrew-core#139862

Merged

fheinecke mentioned this pull request Sep 26, 2023

Release 14.0.1 #32611

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

integrations/access: Make the plugins exit when the connection breaks instead of retrying infinetly and hanging #30039

integrations/access: Make the plugins exit when the connection breaks instead of retrying infinetly and hanging #30039

hugoShaka commented Aug 4, 2023 •

edited

Loading

marcoandredinis Aug 7, 2023

hugoShaka Aug 8, 2023

marcoandredinis Aug 8, 2023

hugoShaka Aug 8, 2023

public-teleport-github-review-bot bot commented Aug 9, 2023

integrations/access: Make the plugins exit when the connection breaks instead of retrying infinetly and hanging #30039

integrations/access: Make the plugins exit when the connection breaks instead of retrying infinetly and hanging #30039

Conversation

hugoShaka commented Aug 4, 2023 • edited Loading

marcoandredinis Aug 7, 2023

Choose a reason for hiding this comment

hugoShaka Aug 8, 2023

Choose a reason for hiding this comment

marcoandredinis Aug 8, 2023

Choose a reason for hiding this comment

hugoShaka Aug 8, 2023

Choose a reason for hiding this comment

public-teleport-github-review-bot bot commented Aug 9, 2023

hugoShaka commented Aug 4, 2023 •

edited

Loading