Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read timed-out commands in the background instead of closing conns. #1588

Conversation

matthewdale
Copy link
Collaborator

@matthewdale matthewdale commented Apr 1, 2024

DO NOT MERGE!

Summary

Experimental change to minimize connection churn during lots-of-timeout events caused by increased database latency.

Fixes:

  • Fix a bug that prevents maxTimeMS from being added to commands when a context with deadline is used on an operation, even if timeoutMS is set on the client.
    • Omit maxTimeMS from operations that create a cursor (Find, Aggregate) because the default behavior of maxTimeMS is to limit the lifetime of the cursor as well as the initial command, which may be surprising. FindOne operations still send maxTimeMS.
  • Use minimum round-trip time instead of 90th percentile round-trip time to calculate maxTimeMS

Experimental:

  • When an operation timeout occurs, mark the in-use connection as "awaiting response" and start a new conn read in a background goroutine. Wait for up to 1 second to get a reply from the server. If the conn read is successful, check the connection into the pool to be reused. If the conn read fails, close the connection.

Based on the v1.14.0 release.

Background & Motivation

The Go Driver currently closes the in-use connection when a command times out on the client side. MongoDB processes commands sent on a single connection sequentially, so sending another command on the same connection after the previous one timed out could lead to the subsequent command taking an unpredictable amount of time to complete. Closing the connection is the simplest way to both resolve the unknown connection state and signal MongoDB to stop working on the command, but can cause problems when connection are closed and created at a very high rate. We've observed that setting maxTimeMS as described in the CSOT spec can cause MongoDB to reliably respond with a server-side timeout error 25-100ms after a client-side timeout occurs. That consistent response timing makes it feasible to await the response in the background after a client-side timeout.

Copy link

mongodb-drivers-pr-bot bot commented Apr 1, 2024

API Change Report

./x/experiment

compatible changes

package added

./x/mongo/driver

incompatible changes

##ExtractErrorFromServerResponse: changed from func(./x/bsonx/bsoncore.Document) error to func(./x/bsonx/bsoncore.Document, bool) error

./x/mongo/driver/topology

compatible changes

BGCallback: added
ConnectionError.Timeout: added

}
// fmt.Println("CHECKIN: Checking in bg conn, waited", time.Since(start))
addBGDuration(p.address.String(), time.Since(start))
err = p.checkIn(conn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the connection needs to be checked back into the pool before the background task completes the read. Otherwise, two sequential operations can lead to two connections being created. An alternative design is to not use background tasks at all and instead complete the read the next time the connection is checked out and used. That's what I did in the Python poc: ShaneHarvey/mongo-python-driver@2de8d17

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both approaches should accomplish the same goal, but may have somewhat different operation duration impact. I'd expect both to have about the same amortized impact to operation duration and throughput, although the distribution of operation durations might be different.

The background read case is easier to implement quickly in the Go driver, so I went with that approach for now. In the Go driver, we can't check the connection in until the background goroutine is done reading it because the topology.connection type (and the underlying net.Conn) are not goroutine-safe. If another goroutine checked out the connection and tried to use it, that would lead to undefined behavior.

I agree that in this case, sequential operations may create multiple connections. Does that create a problem?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sequential operations may create multiple connections. Does that create a problem?

Yes this is a problem because it can lead to a connection storm which is exactly what we're trying to prevent.

@matthewdale
Copy link
Collaborator Author

Closing in favor of the non-experimental PR #1589.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants