Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve HA behavior of database agents in leaf clusters #10641

Merged
merged 3 commits into from
Mar 2, 2022
Merged

Conversation

r0mant
Copy link
Collaborator

@r0mant r0mant commented Feb 25, 2022

Fixes #10640.

The database agent fallback logic in proxy server did not recognize the reverse tunnel error properly when connecting to a leaf cluster agent so the retry logic did not kick in. The reason is when connecting to an agent in the same cluster the returned error is trace.ConnectionProblem, but when connecting to an agent in a leaf cluster it's just a generic error. This fix ensures we check for proper error message to recognize it.

I've also written integration tests for both scenarios, HA in root and HA in leaf. Had to do some shenanigans to make sure we can override shuffling behavior in tests to ensure that in tests we don't rely on chance.

@github-actions github-actions bot added the database-access Database access related issues and PRs label Feb 25, 2022
@r0mant
Copy link
Collaborator Author

r0mant commented Feb 25, 2022

@gabrielcorado Can you also check if the HA for application access we implemented a couple months ago is prone to the same issue and if so implement a fix/test similar to this?

// the reverse tunnel connection is down e.g. because the agent is down.
func isReverseTunnelDownError(err error) bool {
return trace.IsConnectionProblem(err) ||
strings.Contains(err.Error(), reversetunnel.NoDatabaseTunnel)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After adding missing switch r.ConnType -> case types.DatabaseTunnel case and and returning trace.ConnectionProblem(err, the check based on the NoDatabaseTunnel error message text seems be to be not necessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's correct - previously, we would also return trace.ConnectionProblem there (if you look a few lines below, after directDial) and it wasn't detected in the db proxy. To confirm I also tried removing the error message matching and the integration test I wrote failed with the error this PR fixes.

Comment on lines +378 to +384
// Connections to applications and databases should never occur over
// a direct dial, return right away.
switch r.ConnType {
case types.AppTunnel:
return nil, false, trace.ConnectionProblem(err, NoApplicationTunnel)
case types.DatabaseTunnel:
return nil, false, trace.ConnectionProblem(err, NoDatabaseTunnel)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we somehow unify this switch so we will have same common logic for remoteSite and localSite https://github.com/gravitational/teleport/blob/roman/leafdb/lib/reversetunnel/localsite.go#L327

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could but to be honest it doesn't feel like moving it out is really worth it (and would probably complicate implementation a bit too).

// mu protects the shuffleFunc global access.
mu sync.RWMutex
// shuffleFunc provides shuffle behavior for multiple database agents.
shuffleFunc ShuffleFunc = ShuffleRandom
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not big fun of global variables but I assume that this logic is hard to test without this. Though wonder if the db servers order can be enforced in integration tests by settings specific time point in clock used by proxy db service:

clock := NewFakeClockAt(time.Date(2021, time.April, 4, 0, 0, 0, 0, time.UTC))
pack := setupDatabaseTest(t, withClock(clock))

https://github.com/gravitational/teleport/pull/10641/files#diff-3ed9752f98ccf36539f32e345ea5cb26a13d71552d6b9a9be5117308d26b00cbL119

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought about the same actually but didn't know how to ensure the stable shuffle order even with the frozen time. I think in this case shuffle will always produce the same order but there's no way of telling which order. So while I'm not a big fan of this either, this felt like the most reliable way to ensure guaranteed order.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we could always shuffle and in tests add an extra pass to sort them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean adding some sort of "test mode", I was trying to avoid that. I generally try not to make "if isTestMode()" switches in the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to write NVM, but you were faster. Initially I thought about passing extra sort function in test, but that introduces the same issue that we have already.
I agree on the "test mode" too.

Copy link
Contributor

@smallinsky smallinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some suggestion but otherwise the PR LGTM.

@r0mant r0mant enabled auto-merge (squash) March 1, 2022 04:55
@r0mant r0mant requested a review from jakule March 1, 2022 18:59
// the reverse tunnel connection is down e.g. because the agent is down.
func isReverseTunnelDownError(err error) bool {
return trace.IsConnectionProblem(err) ||
strings.Contains(err.Error(), reversetunnel.NoDatabaseTunnel)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Any reason why not to create an var ErrNoDatabaseTunnel and just use here err == ErrNoDatabaseTunnel instead of comparing strings? For me sounds cleaner and safer as current implementation panics when err == nil - not very likely to happen but still.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I looked at it, we just get a generic trace error from reversetunnel/transport in this case so I'm not sure introducing another error type is going to help. Otherwise, the check for trace.ConnectionProblem which we already had here would have worked too.

if trace.IsConnectionProblem(err) {
s.log.WithError(err).Warnf("Failed to dial %v.", server)
// If an agent is down, we'll retry on the next one (if available).
if isReverseTunnelDownError(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to check for error type anyway? Shouldn't we just try to connect to all servers in case of any error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought about it originally and decided to only retry reverse tunnel errors specifically to be on the safe side and not retry errors that should not be retried (like, target database connection errors, rbac, etc.). We do it same way for kube and apps.

// mu protects the shuffleFunc global access.
mu sync.RWMutex
// shuffleFunc provides shuffle behavior for multiple database agents.
shuffleFunc ShuffleFunc = ShuffleRandom
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we could always shuffle and in tests add an extra pass to sort them.

Copy link
Collaborator

@zmb3 zmb3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot

@r0mant r0mant merged commit a480259 into master Mar 2, 2022
@r0mant r0mant deleted the roman/leafdb branch March 2, 2022 02:33
@gabrielcorado
Copy link
Contributor

@r0mant After some tests, this issue also happens in the application agent HA. I've created a PR with a similar fix implemented here and some tests to cover this use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
database-access Database access related issues and PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Database connections don't correctly fallback to healthy agent in leaf clusters
5 participants