-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve HA behavior of database agents in leaf clusters #10641
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,6 +24,9 @@ import ( | |
"io" | ||
"math/rand" | ||
"net" | ||
"sort" | ||
"strings" | ||
"sync" | ||
"time" | ||
|
||
"github.com/gravitational/teleport" | ||
|
@@ -85,12 +88,51 @@ type ProxyServerConfig struct { | |
Clock clockwork.Clock | ||
// ServerID is the ID of the audit log server. | ||
ServerID string | ||
// Shuffle allows to override shuffle logic in tests. | ||
Shuffle func([]types.DatabaseServer) []types.DatabaseServer | ||
// LockWatcher is a lock watcher. | ||
LockWatcher *services.LockWatcher | ||
} | ||
|
||
// ShuffleFunc defines a function that shuffles a list of database servers. | ||
type ShuffleFunc func([]types.DatabaseServer) []types.DatabaseServer | ||
|
||
// ShuffleRandom is a ShuffleFunc that randomizes the order of database servers. | ||
// Used to provide load balancing behavior when proxying to multiple agents. | ||
func ShuffleRandom(servers []types.DatabaseServer) []types.DatabaseServer { | ||
rand.New(rand.NewSource(time.Now().UnixNano())).Shuffle( | ||
len(servers), func(i, j int) { | ||
servers[i], servers[j] = servers[j], servers[i] | ||
}) | ||
return servers | ||
} | ||
|
||
// ShuffleSort is a ShuffleFunc that sorts database servers by name and host ID. | ||
// Used to provide predictable behavior in tests. | ||
func ShuffleSort(servers []types.DatabaseServer) []types.DatabaseServer { | ||
sort.Sort(types.DatabaseServers(servers)) | ||
return servers | ||
} | ||
|
||
var ( | ||
// mu protects the shuffleFunc global access. | ||
mu sync.RWMutex | ||
// shuffleFunc provides shuffle behavior for multiple database agents. | ||
shuffleFunc ShuffleFunc = ShuffleRandom | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not big fun of global variables but I assume that this logic is hard to test without this. Though wonder if the db servers order can be enforced in integration tests by settings specific time point in clock used by proxy db service:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I thought about the same actually but didn't know how to ensure the stable shuffle order even with the frozen time. I think in this case shuffle will always produce the same order but there's no way of telling which order. So while I'm not a big fan of this either, this felt like the most reliable way to ensure guaranteed order. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Technically we could always shuffle and in tests add an extra pass to sort them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you mean adding some sort of "test mode", I was trying to avoid that. I generally try not to make "if isTestMode()" switches in the code. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was about to write NVM, but you were faster. Initially I thought about passing extra sort function in test, but that introduces the same issue that we have already. |
||
) | ||
|
||
// SetShuffleFunc sets the shuffle behavior when proxying to multiple agents. | ||
func SetShuffleFunc(fn ShuffleFunc) { | ||
mu.Lock() | ||
defer mu.Unlock() | ||
shuffleFunc = fn | ||
} | ||
|
||
// getShuffleFunc returns the configured function used to shuffle agents. | ||
func getShuffleFunc() ShuffleFunc { | ||
mu.RLock() | ||
defer mu.RUnlock() | ||
return shuffleFunc | ||
} | ||
|
||
// CheckAndSetDefaults validates the config and sets default values. | ||
func (c *ProxyServerConfig) CheckAndSetDefaults() error { | ||
if c.AccessPoint == nil { | ||
|
@@ -114,15 +156,6 @@ func (c *ProxyServerConfig) CheckAndSetDefaults() error { | |
if c.ServerID == "" { | ||
return trace.BadParameter("missing ServerID") | ||
} | ||
if c.Shuffle == nil { | ||
c.Shuffle = func(servers []types.DatabaseServer) []types.DatabaseServer { | ||
rand.New(rand.NewSource(c.Clock.Now().UnixNano())).Shuffle( | ||
len(servers), func(i, j int) { | ||
servers[i], servers[j] = servers[j], servers[i] | ||
}) | ||
return servers | ||
} | ||
} | ||
if c.LockWatcher == nil { | ||
return trace.BadParameter("missing LockWatcher") | ||
} | ||
|
@@ -351,7 +384,7 @@ func (s *ProxyServer) Connect(ctx context.Context, proxyCtx *common.ProxyContext | |
// There may be multiple database servers proxying the same database. If | ||
// we get a connection problem error trying to dial one of them, likely | ||
// the database server is down so try the next one. | ||
for _, server := range s.cfg.Shuffle(proxyCtx.Servers) { | ||
for _, server := range getShuffleFunc()(proxyCtx.Servers) { | ||
s.log.Debugf("Dialing to %v.", server) | ||
tlsConfig, err := s.getConfigForServer(ctx, proxyCtx.Identity, server) | ||
if err != nil { | ||
|
@@ -364,9 +397,9 @@ func (s *ProxyServer) Connect(ctx context.Context, proxyCtx *common.ProxyContext | |
ConnType: types.DatabaseTunnel, | ||
}) | ||
if err != nil { | ||
// Connection problem indicates reverse tunnel to this server is down. | ||
if trace.IsConnectionProblem(err) { | ||
s.log.WithError(err).Warnf("Failed to dial %v.", server) | ||
// If an agent is down, we'll retry on the next one (if available). | ||
if isReverseTunnelDownError(err) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need to check for error type anyway? Shouldn't we just try to connect to all servers in case of any error? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I thought about it originally and decided to only retry reverse tunnel errors specifically to be on the safe side and not retry errors that should not be retried (like, target database connection errors, rbac, etc.). We do it same way for kube and apps. |
||
s.log.WithError(err).Warnf("Failed to dial database %v.", server) | ||
continue | ||
} | ||
return nil, trace.Wrap(err) | ||
|
@@ -380,6 +413,13 @@ func (s *ProxyServer) Connect(ctx context.Context, proxyCtx *common.ProxyContext | |
return nil, trace.BadParameter("failed to connect to any of the database servers") | ||
} | ||
|
||
// isReverseTunnelDownError returns true if the provided error indicates that | ||
// the reverse tunnel connection is down e.g. because the agent is down. | ||
func isReverseTunnelDownError(err error) bool { | ||
return trace.IsConnectionProblem(err) || | ||
strings.Contains(err.Error(), reversetunnel.NoDatabaseTunnel) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After adding missing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think it's correct - previously, we would also return trace.ConnectionProblem there (if you look a few lines below, after directDial) and it wasn't detected in the db proxy. To confirm I also tried removing the error message matching and the integration test I wrote failed with the error this PR fixes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: Any reason why not to create an There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As far as I looked at it, we just get a generic trace error from reversetunnel/transport in this case so I'm not sure introducing another error type is going to help. Otherwise, the check for trace.ConnectionProblem which we already had here would have worked too. |
||
} | ||
|
||
// Proxy starts proxying all traffic received from database client between | ||
// this proxy and Teleport database service over reverse tunnel. | ||
// | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we somehow unify this switch so we will have same common logic for remoteSite and localSite https://github.com/gravitational/teleport/blob/roman/leafdb/lib/reversetunnel/localsite.go#L327
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could but to be honest it doesn't feel like moving it out is really worth it (and would probably complicate implementation a bit too).