Improve HA behavior of database agents in leaf clusters #10641

r0mant · 2022-02-25T20:01:05Z

The database agent fallback logic in proxy server did not recognize the reverse tunnel error properly when connecting to a leaf cluster agent so the retry logic did not kick in. The reason is when connecting to an agent in the same cluster the returned error is trace.ConnectionProblem, but when connecting to an agent in a leaf cluster it's just a generic error. This fix ensures we check for proper error message to recognize it.

I've also written integration tests for both scenarios, HA in root and HA in leaf. Had to do some shenanigans to make sure we can override shuffling behavior in tests to ensure that in tests we don't rely on chance.

r0mant · 2022-02-25T20:03:35Z

@gabrielcorado Can you also check if the HA for application access we implemented a couple months ago is prone to the same issue and if so implement a fix/test similar to this?

smallinsky · 2022-02-28T15:22:09Z

lib/srv/db/proxyserver.go

+// the reverse tunnel connection is down e.g. because the agent is down.
+func isReverseTunnelDownError(err error) bool {
+	return trace.IsConnectionProblem(err) ||
+		strings.Contains(err.Error(), reversetunnel.NoDatabaseTunnel)


After adding missing switch r.ConnType -> case types.DatabaseTunnel case and and returning trace.ConnectionProblem(err, the check based on the NoDatabaseTunnel error message text seems be to be not necessary.

I don't think it's correct - previously, we would also return trace.ConnectionProblem there (if you look a few lines below, after directDial) and it wasn't detected in the db proxy. To confirm I also tried removing the error message matching and the integration test I wrote failed with the error this PR fixes.

smallinsky · 2022-02-28T15:48:31Z

lib/reversetunnel/transport.go

+		// Connections to applications and databases should never occur over
+		// a direct dial, return right away.
+		switch r.ConnType {
+		case types.AppTunnel:
+			return nil, false, trace.ConnectionProblem(err, NoApplicationTunnel)
+		case types.DatabaseTunnel:
+			return nil, false, trace.ConnectionProblem(err, NoDatabaseTunnel)


nit: can we somehow unify this switch so we will have same common logic for remoteSite and localSite https://github.com/gravitational/teleport/blob/roman/leafdb/lib/reversetunnel/localsite.go#L327

We could but to be honest it doesn't feel like moving it out is really worth it (and would probably complicate implementation a bit too).

smallinsky · 2022-02-28T16:09:00Z

lib/srv/db/proxyserver.go

+	// mu protects the shuffleFunc global access.
+	mu sync.RWMutex
+	// shuffleFunc provides shuffle behavior for multiple database agents.
+	shuffleFunc ShuffleFunc = ShuffleRandom


I'm not big fun of global variables but I assume that this logic is hard to test without this. Though wonder if the db servers order can be enforced in integration tests by settings specific time point in clock used by proxy db service:

clock := NewFakeClockAt(time.Date(2021, time.April, 4, 0, 0, 0, 0, time.UTC)) pack := setupDatabaseTest(t, withClock(clock))

https://github.com/gravitational/teleport/pull/10641/files#diff-3ed9752f98ccf36539f32e345ea5cb26a13d71552d6b9a9be5117308d26b00cbL119

Yeah I thought about the same actually but didn't know how to ensure the stable shuffle order even with the frozen time. I think in this case shuffle will always produce the same order but there's no way of telling which order. So while I'm not a big fan of this either, this felt like the most reliable way to ensure guaranteed order.

Technically we could always shuffle and in tests add an extra pass to sort them.

If you mean adding some sort of "test mode", I was trying to avoid that. I generally try not to make "if isTestMode()" switches in the code.

I was about to write NVM, but you were faster. Initially I thought about passing extra sort function in test, but that introduces the same issue that we have already.
I agree on the "test mode" too.

smallinsky

Left some suggestion but otherwise the PR LGTM.

jakule · 2022-03-01T19:10:10Z

lib/srv/db/proxyserver.go

+// the reverse tunnel connection is down e.g. because the agent is down.
+func isReverseTunnelDownError(err error) bool {
+	return trace.IsConnectionProblem(err) ||
+		strings.Contains(err.Error(), reversetunnel.NoDatabaseTunnel)


nit: Any reason why not to create an var ErrNoDatabaseTunnel and just use here err == ErrNoDatabaseTunnel instead of comparing strings? For me sounds cleaner and safer as current implementation panics when err == nil - not very likely to happen but still.

As far as I looked at it, we just get a generic trace error from reversetunnel/transport in this case so I'm not sure introducing another error type is going to help. Otherwise, the check for trace.ConnectionProblem which we already had here would have worked too.

jakule · 2022-03-01T19:17:21Z

lib/srv/db/proxyserver.go

-			if trace.IsConnectionProblem(err) {
-				s.log.WithError(err).Warnf("Failed to dial %v.", server)
+			// If an agent is down, we'll retry on the next one (if available).
+			if isReverseTunnelDownError(err) {


Why do we need to check for error type anyway? Shouldn't we just try to connect to all servers in case of any error?

Yeah I thought about it originally and decided to only retry reverse tunnel errors specifically to be on the safe side and not retry errors that should not be retried (like, target database connection errors, rbac, etc.). We do it same way for kube and apps.

jakule · 2022-03-01T19:20:48Z

lib/srv/db/proxyserver.go

+	// mu protects the shuffleFunc global access.
+	mu sync.RWMutex
+	// shuffleFunc provides shuffle behavior for multiple database agents.
+	shuffleFunc ShuffleFunc = ShuffleRandom


Technically we could always shuffle and in tests add an extra pass to sort them.

zmb3

Bot

gabrielcorado · 2022-03-02T15:12:15Z

@r0mant After some tests, this issue also happens in the application agent HA. I've created a PR with a similar fix implemented here and some tests to cover this use case.

)

r0mant requested review from gabrielcorado and smallinsky February 25, 2022 20:01

github-actions bot requested review from espadolini and fspmarshall February 25, 2022 20:01

github-actions bot added the database-access Database access related issues and PRs label Feb 25, 2022

r0mant removed request for espadolini and fspmarshall February 25, 2022 20:01

smallinsky reviewed Feb 28, 2022

View reviewed changes

smallinsky approved these changes Feb 28, 2022

View reviewed changes

Improve HA behavior of database agents in leaf clusters

0493d28

r0mant force-pushed the roman/leafdb branch from bf5c13b to 0493d28 Compare March 1, 2022 04:55

r0mant enabled auto-merge (squash) March 1, 2022 04:55

r0mant requested a review from jakule March 1, 2022 18:59

jakule reviewed Mar 1, 2022

View reviewed changes

jakule approved these changes Mar 1, 2022

View reviewed changes

r0mant added 2 commits March 1, 2022 13:01

Merge branch 'master' into roman/leafdb

507f229

Merge branch 'master' into roman/leafdb

1382ca0

zmb3 approved these changes Mar 2, 2022

View reviewed changes

r0mant merged commit a480259 into master Mar 2, 2022

r0mant deleted the roman/leafdb branch March 2, 2022 02:33

gabrielcorado mentioned this pull request Mar 2, 2022

Improve HA behavior of application agents in leaf clusters #10734

Merged

r0mant added a commit that referenced this pull request Mar 3, 2022

Improve HA behavior of database agents in leaf clusters (#10641)

688e053

r0mant added a commit that referenced this pull request Mar 3, 2022

Improve HA behavior of database agents in leaf clusters (#10641)

b125952

This was referenced Mar 3, 2022

(v8) Improve HA behavior of database agents in leaf clusters #10770

Merged

(v9) Improve HA behavior of database agents in leaf clusters #10771

Merged

r0mant added a commit that referenced this pull request Mar 3, 2022

Improve HA behavior of database agents in leaf clusters (#10641) (#10771

e3d0504

)

r0mant added a commit that referenced this pull request Mar 3, 2022

Improve HA behavior of database agents in leaf clusters (#10641) (#10770

ad55e5e

)

webvictim mentioned this pull request Mar 4, 2022

Opened in error #10856

Closed

webvictim mentioned this pull request Apr 19, 2022

opened in error #12068

Closed

webvictim mentioned this pull request Jun 8, 2022

Opened in error #13285

Closed

alexatcanva mentioned this pull request Oct 13, 2022

BUGFIX | Fix Teleport ALPN Proxy not being HTTP CONNECT Proxy Aware alexatcanva/teleport#30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve HA behavior of database agents in leaf clusters #10641

Improve HA behavior of database agents in leaf clusters #10641

r0mant commented Feb 25, 2022

r0mant commented Feb 25, 2022

smallinsky Feb 28, 2022

r0mant Mar 1, 2022

smallinsky Feb 28, 2022

r0mant Mar 1, 2022

smallinsky Feb 28, 2022

r0mant Mar 1, 2022

jakule Mar 1, 2022

r0mant Mar 1, 2022

jakule Mar 1, 2022

smallinsky left a comment

jakule Mar 1, 2022

r0mant Mar 1, 2022

jakule Mar 1, 2022

r0mant Mar 1, 2022

jakule Mar 1, 2022

zmb3 left a comment

gabrielcorado commented Mar 2, 2022

Improve HA behavior of database agents in leaf clusters #10641

Improve HA behavior of database agents in leaf clusters #10641

Conversation

r0mant commented Feb 25, 2022

r0mant commented Feb 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smallinsky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zmb3 left a comment

Choose a reason for hiding this comment

gabrielcorado commented Mar 2, 2022