Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GODRIVER-1898 SDAM error handling changes for LB mode #611

Merged
merged 3 commits into from
Mar 24, 2021

Conversation

divjotarora
Copy link
Contributor

@divjotarora divjotarora commented Mar 15, 2021

Summary of changes:

  1. Instead of tracking a pool's generation number as a single uint64, use a map from serverID (which is an ObjectID) to a tuple of (generation number, connection count). In this new map type, primitive.NilObjectID is used to store the generation and connection count for servers that do not return a serverId field in ismaster responses (i.e. servers in non-load balanced deployments).
  2. When clearing the pool, only clear connections for a specific server ID.
  3. Add a ServerID field to PoolCleared events.
  4. Ignore connection establishment errors that occur before the server ID is known for connections to LBs.

Copy link
Contributor

@benjirewis benjirewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Frankly there are some code changes here that really go over my head, but I think that's only due to my poor understanding of connection pooling. I'll probably ask some questions offline.

x/mongo/driver/topology/pool.go Show resolved Hide resolved
Copy link
Contributor

@kevinAlbs kevinAlbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done, the thoughtful comments and description were really helpful. I have a few questions to check my own understanding, but code LGTM.

// Ignore the error if the server is behind a load balancer but the server ID is unknown. This indicates that the
// error happened when dialing the connection or during the MongoDB handshake, so we don't know the server ID to use
// for clearing the pool.
if err == nil || s.cfg.loadBalanced && serverID == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, it should be impossible for serverID to be primitive.NilObjectID when in load balanced mode. In load balanced mode, any handshake that omits a serverID in the reply will result in an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there's a distinction between nil and primitive.NilObjectID. The former is a nil pointer and the latter is a non-nil pointer that contains 12 0 bytes. In LB mode, the former is possible when we get here if we encountered an error before the initial hello succeeded. The latter is never possible because 12 0 bytes is not a valid ObjectId (at least not right now because the timestamp field would be non-zero).

x/mongo/driver/topology/connection.go Outdated Show resolved Hide resolved
// setGenerationNumber sets the connection's generation number if a callback has been provided to do so in connection
// configuration.
func (c *connection) setGenerationNumber() {
if c.config.getGenerationFn != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would getGenerationFn ever be nil? The serverID => (generation, count) map is used to determine generation regardless of whether we are in load balanced mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For unit tests, it can be. I verified that removing the line in pool.go that sets this function causes some of the new spec tests to fail though.

@@ -288,7 +294,7 @@ func (s *Server) ProcessHandshakeError(err error, startingGenerationNumber uint6
// the description.Server appropriately. The description should not have a TopologyVersion because the staleness
// checking logic above has already determined that this description is not stale.
s.updateDescription(description.NewServerFromError(s.address, wrappedConnErr, nil))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check, would updating the server description to unknown make it ineligible for server selection (and similarly in ProcessError?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec actually prohibits us from marking the server Unknown to ensure that the LB is always selectable. As part of a previous PR in this epic, I changed updateDescription to be a no-op when the server type is LoadBalancer. I felt this was a cleaner approach than checking every invocation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, that makes sense. Thanks for the explanations!

Co-authored-by: Kevin Albertson <kevin.albertson@10gen.com>
@divjotarora divjotarora merged commit caeb478 into mongodb:master Mar 24, 2021
@divjotarora divjotarora deleted the godriver1898-lb-sdam-errors branch March 24, 2021 19:59
faem pushed a commit to kubedb/mongo-go-driver that referenced this pull request Mar 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants