Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

database/sql: reuses expired connections with strategy= alwaysNewConn #32530

Closed
gwax opened this issue Jun 10, 2019 · 20 comments
Closed

database/sql: reuses expired connections with strategy= alwaysNewConn #32530

gwax opened this issue Jun 10, 2019 · 20 comments
Assignees
Milestone

Comments

@gwax
Copy link

@gwax gwax commented Jun 10, 2019

What version of Go are you using (go version)?

$ go version
go version go1.11.5 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/georgeleslie-waksman/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/georgeleslie-waksman/co/backend/go"
GOPROXY=""
GORACE=""
GOROOT="/Users/georgeleslie-waksman/co/backend/opt/go1.11.5"
GOTMPDIR=""
GOTOOLDIR="/Users/georgeleslie-waksman/co/backend/opt/go1.11.5/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/41/19kt662x72n9nfj8hzb8t1q00000gq/T/go-build118274996=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

When connections are released, they may bypass the pool and be used to fulfill requests made with strategy=alwaysNewConn even if they are "expired" according to SetConnMaxLifetime, which allows the connections to fail on BeginTx with ErrBadConn despite passing through putConn with err == nil

The following testcase (for database/sql_test.go) exhibits the failure case:

func TestPutConnExpiredOrBadReset(t *testing.T) {
	execCases := []struct{
		expired bool
		badReset bool
	}{
		{false, false},
		{true, false},
		{false, true},
	}

	t0 := time.Unix(1000000, 0)
	offset := time.Duration(0)

	nowFunc = func() time.Time { return t0.Add(offset) }
	defer func() { nowFunc = time.Now}()

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	db := newTestDB(t, "magicquery")
	defer closeDB(t, db)

	db.SetMaxOpenConns(1)

	for _, ec := range execCases {
		ec := ec
		name := fmt.Sprintf("expired=%t,badReset=%t", ec.expired, ec.badReset)
		t.Run(name, func(t *testing.T) {
			db.clearAllConns(t)

			db.SetMaxIdleConns(1)
			db.SetConnMaxLifetime(10 * time.Second)

			conn, err := db.conn(ctx, alwaysNewConn)
			if err != nil {
				t.Fatal(err)
			}

			var wg sync.WaitGroup
			wg.Add(1)
			go func() {
				defer wg.Done()
				conn, err := db.conn(ctx, alwaysNewConn)
				if err != nil {
					t.Fatal(err)
				}
				db.putConn(conn, err, false)
			}()

			// Wait for pending request
			for len(db.connRequests) < 1 {}

			if ec.expired {
				offset = 11 * time.Second
			} else {
				offset = time.Duration(0)
			}

			if ec.badReset {
				conn.ci.(*fakeConn).stickyBad = true
			}
			db.putConn(conn, err, true)

			wg.Wait()
		})
	}
}

with output:

--- FAIL: TestPutConnExpiredOrBadReset (0.00s)
    --- FAIL: TestPutConnExpiredOrBadReset/expired=true,badReset=false (0.00s)
        sql_test.go:2226: driver: bad connection
    --- FAIL: TestPutConnExpiredOrBadReset/expired=false,badReset=true (0.00s)
        sql_test.go:2226: driver: bad connection
FAIL
FAIL	_/Users/georgeleslie-waksman/co/go/src/database/sql	1.337s

What did you expect to see?

BeginTx's third attempt with strategy=alwaysNewConn should not reuse an existing connection and should definitely not reuse an expired connection.

What did you see instead?

BeginTx's third attempt fails with ErrBadConn due to connection expiration.

@gwax
Copy link
Author

@gwax gwax commented Jun 10, 2019

To the best of my understanding, we can resolve the issue by modifying putConnDbLocked so that only new connections may only be directly used to fulfill requests and reused connections must pass through the idle pool.

@gwax
Copy link
Author

@gwax gwax commented Jun 10, 2019

This issue also occurs and is seriously exacerbated if the session resetter channel db.resetterCh becomes blocked, in which case connections will be released back with err = driver.ErrBadConn and sent for reuse.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jun 10, 2019

@gwax
Copy link
Author

@gwax gwax commented Jun 12, 2019

I have updated the issue with a more minimal failure case that can be dropped directly into database/sql_test.go

@gwax
Copy link
Author

@gwax gwax commented Jun 17, 2019

I have put together a fix and submitted a PR: #32656

@gopherbot
Copy link

@gopherbot gopherbot commented Jun 17, 2019

Change https://golang.org/cl/182599 mentions this issue: database/sql: prevent conn reuse on new conn request

@kardianos
Copy link
Contributor

@kardianos kardianos commented Aug 1, 2019

@gwax Let's ignore the session resetter issue for now, I will resolve that in the next release.

Can you explain when you encounter this issue, and what it looks like for your application when you do? Thanks.

@gwax
Copy link
Author

@gwax gwax commented Aug 1, 2019

In our application, with SetConnMaxLifetime, we experience a small but steady rate of driver.ErrBadConn being returned by DB.BeginTx.

In debugging and tracing the issue, we found the code path is as follows:

  1. DB.BeginTx tries twice with strategy == cachedOrNewConn, getting driver.ErrBadConn both times: https://github.com/golang/go/blob/master/src/database/sql/sql.go#L1676
  2. DB.BeginTx calls DB.begin with strategy == alwaysNewConn: https://github.com/golang/go/blob/master/src/database/sql/sql.go#L1682
  3. DB.begin calls DB.conn with strategy == alwaysNewConn: https://github.com/golang/go/blob/master/src/database/sql/sql.go#L1694
  4. We have the maximum number of connections open so we push a request to db.connRequests: https://github.com/golang/go/blob/master/src/database/sql/sql.go#L1176
  5. (thread2) Concurrently to this, another thread finishes execution with a connection that has gone (or is just about to go over its lifetime) and attempts to put the connection back: https://github.com/golang/go/blob/master/src/database/sql/sql.go#L1322
  6. (thread2) The released connection is then used to satisfy the pending request (reused connection for a new alwaysNewConn request): https://github.com/golang/go/blob/master/src/database/sql/sql.go#L1371
  7. DB.conn checks if the connection has expired (it has) and we return driver.ErrBadConn: https://github.com/golang/go/blob/master/src/database/sql/sql.go#L1207

Adding an expiry check to the putConn codepath decreased the rate of errors (smaller concurrency race window) but did not eliminate them.

Enabling session resetter hits the same race condition but spikes our error rate. In this case, the resetter was overloaded and we were having dc.lastErr set concurrently to the connection release (step 6): https://github.com/golang/go/blob/master/src/database/sql/sql.go#L1339

@gwax
Copy link
Author

@gwax gwax commented Aug 1, 2019

The PR that I have submitted is fairly large and is, perhaps, clearer to understand as two separate changes (I can resubmit as two PRs, if that makes it easier).

The first change is quite large, and refactors db.freeConn from []*driverConn to chan *driverConn. This change is necessary so that while we are at max connections and waiting for a request to be filled, we can use a case statement to unblock on either a direct request or on a connection going to the pool: 6baa77b

The second change is quite small and makes sure that reused sessions always go to the connection pool, not connRequests: 9816712

@kardianos
Copy link
Contributor

@kardianos kardianos commented Aug 6, 2019

@gwax Thank you for the bullet point description of events that lead up to this.

Don't resend the PR yet. There is actually a deeper issue I would like to address here I think. The propensity for this connection pool implementation to retain expired connections is a fundamental design issue.

Any change we do won't make it into the Go1.13 release (this is on me, I'm sorry). So we are targeting Go1.14 here.

Fundamentally, if a driver connection is expired then it should be closed and a new one should be returned, "atomically" within conn. If we factor out the code that opens a new connection at the end of db.conn, then allow it to be called from either expired or normal case, that seems more optimal then worrying about how the connections are represented in a queue (and would touch less code).

What do you think of that approach?

@gwax
Copy link
Author

@gwax gwax commented Aug 11, 2019

I am happy to live with a Go1.14 target and patch our deployed version of the code. I far prefer a patch that will eventually be in mainline so that we can timebox deploying a fork on our end.

Ultimately, I am in favor of any approach that fixes this bug, with a preference for solutions that minimize race conditions and complexity, while maximizing readability. If a small fix is possible, I will always prefer changing fewer lines of code over any other preferences.

Ultimately, I don't think fully avoiding expired/broken connections in the pool is possible. A connection could expire at the very moment it is pulled from the pool no matter how aggressively we clean it. That, and, aggressive cleaning comes at a cost.

I do, in principle, like the idea, which I think you're proposing, that we could close an expired connection and open a new one all within the same conn() request (presumably with the relevant locks held to ensure connection limits and concurrent safety).

That said, I think we're still likely to run into race conditions and queuing fairness issues if we continue to use a slice for the connection pool and channels for new requests. Every other, smaller, attempt I made to fix this issue ran afoul of something.

From a readability and maintainability standpoint, I would really prefer if strategy = alwaysNewConn always meant that a connection was not being reused.

My PR was written with the goal of having the slice -> buffered chan connection pool change be the only architectural change. I could take another pass and see if my changes could be simplified or made to touch fewer places.

In the end, I think we want a conn() method of the form:

db.connRequests <- something
select {
case <-ctx.Done():
  // bail
case conn, ok := <-db.connectionPool:
  // if not ok return db closed
  // if good connection and newOrCached: return
  // else close, open and return a new one
case conn, ok := <-db.filledRequests:
  // if not ok return db closed
  // return conn
}

If we take a broader approach, I would love to see db.connRequests become FIFO instead of depending on whatever is the first value to come when iterating through the map via range. I am also happy to leave that as outside the scope of this issue and either file another, or disregard.

If there's a plan, or architecture, that you're ok with, I have some time and energy available for implementation.

@gwax
Copy link
Author

@gwax gwax commented Nov 11, 2019

@kardianos this issue still exists and is still causing errors for us. Could you provide some attention or direction to this?

In the hopes of trying to get a fix into the standard library, I have reorganized and curated the commits in my earlier PR: #32656

The commits now reflect three separate steps in the fixing process:

  • 13d516d - a refactor of the code to convert the free connection pool from a slice to a buffered channel (this is necessary to properly handle concurrent pool access).
  • 08f679e - a failing test
  • 93f9037 - a fix to prevent old sessions being used to fill new requests

If there is more that I can do or another approach that you think is better, please provide me with some direction. I am happy to devote time and energy to creating a fix, as long as my fix has a chance of landing.

@gopherbot
Copy link

@gopherbot gopherbot commented Jan 24, 2020

Change https://golang.org/cl/216197 mentions this issue: database/sql: do not return expired conn when new conn requested

@kardianos
Copy link
Contributor

@kardianos kardianos commented Jan 24, 2020

@gwax Please try out this CL: https://go-review.googlesource.com/c/go/+/216197

You'll need both in the series, and you should implement the ConnectionDiscarder on the driver. But it should work.

@kardianos kardianos added this to the Go1.15 milestone Apr 20, 2020
@kardianos
Copy link
Contributor

@kardianos kardianos commented Apr 20, 2020

CL Ready. Waiting for a reviewer.

@gopherbot
Copy link

@gopherbot gopherbot commented Jul 11, 2020

Change https://golang.org/cl/242101 mentions this issue: [release-branch.go1.14] database/sql: backport 3 Tx rollback related CLs

@gopherbot
Copy link

@gopherbot gopherbot commented Jul 11, 2020

Change https://golang.org/cl/242102 mentions this issue: [release-branch.go1.14] database/sql: backport 3 Tx rollback related CLs

@gopherbot
Copy link

@gopherbot gopherbot commented Jul 15, 2020

Change https://golang.org/cl/242522 mentions this issue: [release-branch.go1.13] database/sql: backport 5 Tx rollback related CLs

gopherbot pushed a commit that referenced this issue Jul 16, 2020
Manually backported the subject CLs, because of lack of
Gerrit "forge-author" permissions, but also because the prior
cherry picks didn't apply cleanly, due to a tight relation chain.

The backport comprises of:
* CL 174122
* CL 216197
* CL 223963
* CL 216240
* CL 216241

Note:
Due to the restrictions that we cannot retroactively
introduce API changes to Go1.13.13 that weren't in Go1.13, the Conn.Validator
interface (from CL 174122, CL 223963) isn't exposed, and drivers will just be
inspected, for if they have an IsValid() bool method implemented.

For a description of the content of each CL:

* CL 174122:
database/sql: process all Session Resets synchronously

Adds a new interface, driver.ConnectionValidator, to allow
drivers to signal they should not be used again,
separatly from the session resetter interface.
This is done now that the session reset is done
after the connection is put into the connection pool.

Previous behavior attempted to run Session Resets
in a background worker. This implementation had two
problems: untested performance gains for additional
complexity, and failures when the pool size
exceeded the connection reset channel buffer size.

* CL 216197:
database/sql: check conn expiry when returning to pool, not when handing it out

With the original connection reuse strategy, it was possible that
when a new connection was requested, the pool would wait for an
an existing connection to return for re-use in a full connection
pool, and then it would check if the returned connection was expired.
If the returned connection expired while awaiting re-use, it would
return an error to the location requestiong the new connection.
The existing call sites requesting a new connection was often the last
attempt at returning a connection for a query. This would then
result in a failed query.

This change ensures that we perform the expiry check right
before a connection is inserted back in to the connection pool
for while requesting a new connection. If requesting a new connection
it will no longer fail due to the connection expiring.

* CL 216240:
database/sql: prevent Tx statement from committing after rollback

It was possible for a Tx that was aborted for rollback
asynchronously to execute a query after the rollback had completed
on the database, which often would auto commit the query outside
of the transaction.

By W-locking the tx.closemu prior to issuing the rollback
connection it ensures any Tx query either fails or finishes
on the Tx, and never after the Tx has rolled back.

* CL 216241:
database/sql: on Tx rollback, retain connection if driver can reset session

Previously the Tx would drop the connection after rolling back from
a context cancel. Now if the driver can reset the session,
keep the connection.

* CL 223963
database/sql: add test for Conn.Validator interface

This addresses comments made by Russ after
https://golang.org/cl/174122 was merged. It addes a test
for the connection validator and renames the interface to just
"Validator".

Updates #31480
Updates #32530
Updates #32942
Updates #34775
Fixes #40205

Change-Id: I6d7307180b0db0bf159130d91161764cf0f18b58
Reviewed-on: https://go-review.googlesource.com/c/go/+/242522
Run-TryBot: Emmanuel Odeke <emm.odeke@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Daniel Theophanes <kardianos@gmail.com>
gopherbot pushed a commit that referenced this issue Jul 16, 2020
Manually backported the subject CLs, because of lack of
Gerrit "forge-author" permissions, but also because the prior
cherry picks didn't apply cleanly, due to a tight relation chain.

The backport comprises of:
* CL 174122
* CL 216197
* CL 223963
* CL 216240
* CL 216241

Note:
Due to the restrictions that we cannot retroactively
introduce API changes to Go1.14.6 that weren't in Go1.14, the Conn.Validator
interface (from CL 174122, CL 223963) isn't exposed, and drivers will just be
inspected, for if they have an IsValid() bool method implemented.

For a description of the content of each CL:

* CL 174122:
database/sql: process all Session Resets synchronously

Adds a new interface, driver.ConnectionValidator, to allow
drivers to signal they should not be used again,
separatly from the session resetter interface.
This is done now that the session reset is done
after the connection is put into the connection pool.

Previous behavior attempted to run Session Resets
in a background worker. This implementation had two
problems: untested performance gains for additional
complexity, and failures when the pool size
exceeded the connection reset channel buffer size.

* CL 216197:
database/sql: check conn expiry when returning to pool, not when handing it out

With the original connection reuse strategy, it was possible that
when a new connection was requested, the pool would wait for an
an existing connection to return for re-use in a full connection
pool, and then it would check if the returned connection was expired.
If the returned connection expired while awaiting re-use, it would
return an error to the location requestiong the new connection.
The existing call sites requesting a new connection was often the last
attempt at returning a connection for a query. This would then
result in a failed query.

This change ensures that we perform the expiry check right
before a connection is inserted back in to the connection pool
for while requesting a new connection. If requesting a new connection
it will no longer fail due to the connection expiring.

* CL 216240:
database/sql: prevent Tx statement from committing after rollback

It was possible for a Tx that was aborted for rollback
asynchronously to execute a query after the rollback had completed
on the database, which often would auto commit the query outside
of the transaction.

By W-locking the tx.closemu prior to issuing the rollback
connection it ensures any Tx query either fails or finishes
on the Tx, and never after the Tx has rolled back.

* CL 216241:
database/sql: on Tx rollback, retain connection if driver can reset session

Previously the Tx would drop the connection after rolling back from
a context cancel. Now if the driver can reset the session,
keep the connection.

* CL 223963
database/sql: add test for Conn.Validator interface

This addresses comments made by Russ after
https://golang.org/cl/174122 was merged. It addes a test
for the connection validator and renames the interface to just
"Validator".

Updates #31480
Updates #32530
Updates #32942
Updates #34775
Fixes #39101

Change-Id: I043d2d724a367588689fd7d6f3cecb39abeb042c
Reviewed-on: https://go-review.googlesource.com/c/go/+/242102
Run-TryBot: Emmanuel Odeke <emm.odeke@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Daniel Theophanes <kardianos@gmail.com>
@stevenh
Copy link
Contributor

@stevenh stevenh commented Jul 26, 2020

Thanks for this, been chasing this in our apps for a while, thought it was a driver error but after adding logging to the driver it was never returning this, so this could be the cause of our issues too.

@mong0520
Copy link

@mong0520 mong0520 commented Jul 30, 2020

Thank for @gwax and @kardianos 's effort to address it out, it helps a lot on trouble shooting my application.
I can expect this issue will be resolved on go v1.15 , right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.