Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

database/sql: DB full resetterCh causes driver.ErrBadConn error #31480

Closed
tuanpavn opened this issue Apr 16, 2019 · 17 comments
Closed

database/sql: DB full resetterCh causes driver.ErrBadConn error #31480

tuanpavn opened this issue Apr 16, 2019 · 17 comments
Assignees
Labels
Milestone

Comments

@tuanpavn
Copy link

@tuanpavn tuanpavn commented Apr 16, 2019

What version of Go are you using (go version)?

$ go version
go version go1.12 linux/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

go env GOHOSTARCH="amd64" GOHOSTOS="linux" GOOS="linux"
$ go env

What did you do?

I am doing stress test for mysql server using golang.
I create a sql.DB and set

db.SetMaxOpenConns(128)
db.SetMaxIdleConns(32) 

Then I create 500 go-routines (500 clients) and send 1000000 queries to mysql server.
After I run the program, it sometimes pops up error "driver: bad connection" (driver.ErrBadConn).

I found that in sql.OpenDB, it creates a *sql.DB struct with:
resetterCh: make(chan *driverConn, 50)

In func (db *DB) putConn(dc *driverConn, err error, resetSession bool), if db.resetterCh is full, it marks connection as bad

select {
default:
	// If the resetterCh is blocking then mark the connection
	// as bad and continue on.
	dc.lastErr = driver.ErrBadConn
	dc.Unlock()
case db.resetterCh <- dc:
}

and if number of connections exceeds max connection, it reuses old connection which is marked as bad and return driver.ErrBadConn.

I can solve it by set max connection less than 50 (which is size of db.resetterCh).
Why did you hardcoded size of db.resetterCh to 50?
Should it be set to max connections?

https://play.golang.org/p/phUILuRV3hJ

What did you expect to see?

What did you see instead?

@kardianos kardianos self-assigned this Apr 16, 2019
@kardianos kardianos added the NeedsFix label Apr 16, 2019
@n-ozerov
Copy link

@n-ozerov n-ozerov commented Apr 26, 2019

experiencing same issue

@gopherbot
Copy link

@gopherbot gopherbot commented Apr 26, 2019

Change https://golang.org/cl/174122 mentions this issue: database/sql: do not mark conn as bad when resetter is full

@kardianos
Copy link
Contributor

@kardianos kardianos commented Apr 26, 2019

I can see two solutions to this issue:

  • resize the resetter channel when the pool size is > 50
  • start to process the session reset synchronously. when the fixed size resetter is full.

I'll look into this more. I'm not ready for a final fix (the referenced CL isn't quite right I think). If someone wants to submit a CL in the next couple of days that would be great. We are really close to the 1.13 freeze.

@wekb
Copy link

@wekb wekb commented Jul 15, 2019

I think you'd need to do both. This is a bit of a black-box issue that sporadically effects production services—is there an ETA here, or would a submitted CL be processed fairly quickly? Thanks.

@kardianos
Copy link
Contributor

@kardianos kardianos commented Jul 15, 2019

I think it is too late in the cycle to subit the CL, but if you want to, you could carry-pick the change locally and try it out.

@wekb
Copy link

@wekb wekb commented Jul 15, 2019

Would 1.12.x be game, or does the 1.13 freeze effect 1.12.x updates?

@kardianos
Copy link
Contributor

@kardianos kardianos commented Jul 15, 2019

@bcmills I'm unsure on release policy. I don't think 1.12.x would be game and I believe 1.13 is too deeply frozen for this change. Can you provide feedback? Ping me on Hangouts or here if you need detail on exact scope of change, CL linked.

@wekb Can you help validate the linked CL?

@bcmills
Copy link
Member

@bcmills bcmills commented Jul 16, 2019

@kardianos if the change is too invasive or subtle for 1.13 at this point, then it's almost certainly not a candidate for a 1.12 backport either, especially given that there is a workaround (namely, setting the connection limit ≤ 50).

@wekb
Copy link

@wekb wekb commented Jul 16, 2019

This would be a problem for larger services where 50 concurrent connections isn't adequate, however.

The CL builds and tests out on my end under pretty good stress—500 go routines spamming local and remote MySQL instances. I patched 1.12.7 and the two versions of go performed the same.

That being said, I'm not able to reproduce the reported issue outside of prod, and it would be difficult to deploy this in our production without a bit of a gamut of new Docker images, not to mention that we're trying to not have this issue effect us poorly again. :)

@wekb
Copy link

@wekb wekb commented Jul 17, 2019

I patched 1.12.7, and it builds and tests out on my end,

I put some fairly heavy load on both 1.12.7 and patched 1.12.7, basically the OP's 500-worker use case. I wasn't able to reproduce the issue locally, but both versions performed fine and comparably. The build-in go tests passed as well, obviously.

It would be difficult to deploy this in our production to reproduce the issue without undoing workaround code and gamut of new Docker images, not to mention that we're trying to not have this issue effect us poorly again. :)

All that said, the CL seems worthy, IMO.

@wekb
Copy link

@wekb wekb commented Sep 4, 2019

Any chance https://golang.org/cl/174122 can be pushed into 13.++ before it's forgotten? :-)

@bcmills
Copy link
Member

@bcmills bcmills commented Sep 11, 2019

@wekb, last I checked CL 174122 was awaiting a bit of rework to avoid blocking indefinitely in ResetSession.

When the fix is ready, I suspect it will not be eligible for backporting: to my knowledge it is not a regression, and there is a workaround today (setting MaxOpenConns ≤ 50).

@wekb
Copy link

@wekb wekb commented Sep 12, 2019

Thanks. I just wanted to represent that there's ongoing interest and that it makes it into a release in due time.

@gopherbot gopherbot closed this in 971f8a2 Mar 17, 2020
@odeke-em
Copy link
Member

@odeke-em odeke-em commented Mar 18, 2020

@odeke-em odeke-em reopened this Mar 18, 2020
@dmitshur dmitshur added this to the Go1.15 milestone Mar 18, 2020
@gopherbot
Copy link

@gopherbot gopherbot commented Jul 11, 2020

Change https://golang.org/cl/242102 mentions this issue: [release-branch.go1.14] database/sql: backport 5 Tx rollback related CLs

@gopherbot
Copy link

@gopherbot gopherbot commented Jul 15, 2020

Change https://golang.org/cl/242522 mentions this issue: [release-branch.go1.13] database/sql: backport 5 Tx rollback related CLs

gopherbot pushed a commit that referenced this issue Jul 16, 2020
Manually backported the subject CLs, because of lack of
Gerrit "forge-author" permissions, but also because the prior
cherry picks didn't apply cleanly, due to a tight relation chain.

The backport comprises of:
* CL 174122
* CL 216197
* CL 223963
* CL 216240
* CL 216241

Note:
Due to the restrictions that we cannot retroactively
introduce API changes to Go1.13.13 that weren't in Go1.13, the Conn.Validator
interface (from CL 174122, CL 223963) isn't exposed, and drivers will just be
inspected, for if they have an IsValid() bool method implemented.

For a description of the content of each CL:

* CL 174122:
database/sql: process all Session Resets synchronously

Adds a new interface, driver.ConnectionValidator, to allow
drivers to signal they should not be used again,
separatly from the session resetter interface.
This is done now that the session reset is done
after the connection is put into the connection pool.

Previous behavior attempted to run Session Resets
in a background worker. This implementation had two
problems: untested performance gains for additional
complexity, and failures when the pool size
exceeded the connection reset channel buffer size.

* CL 216197:
database/sql: check conn expiry when returning to pool, not when handing it out

With the original connection reuse strategy, it was possible that
when a new connection was requested, the pool would wait for an
an existing connection to return for re-use in a full connection
pool, and then it would check if the returned connection was expired.
If the returned connection expired while awaiting re-use, it would
return an error to the location requestiong the new connection.
The existing call sites requesting a new connection was often the last
attempt at returning a connection for a query. This would then
result in a failed query.

This change ensures that we perform the expiry check right
before a connection is inserted back in to the connection pool
for while requesting a new connection. If requesting a new connection
it will no longer fail due to the connection expiring.

* CL 216240:
database/sql: prevent Tx statement from committing after rollback

It was possible for a Tx that was aborted for rollback
asynchronously to execute a query after the rollback had completed
on the database, which often would auto commit the query outside
of the transaction.

By W-locking the tx.closemu prior to issuing the rollback
connection it ensures any Tx query either fails or finishes
on the Tx, and never after the Tx has rolled back.

* CL 216241:
database/sql: on Tx rollback, retain connection if driver can reset session

Previously the Tx would drop the connection after rolling back from
a context cancel. Now if the driver can reset the session,
keep the connection.

* CL 223963
database/sql: add test for Conn.Validator interface

This addresses comments made by Russ after
https://golang.org/cl/174122 was merged. It addes a test
for the connection validator and renames the interface to just
"Validator".

Updates #31480
Updates #32530
Updates #32942
Updates #34775
Fixes #40205

Change-Id: I6d7307180b0db0bf159130d91161764cf0f18b58
Reviewed-on: https://go-review.googlesource.com/c/go/+/242522
Run-TryBot: Emmanuel Odeke <emm.odeke@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Daniel Theophanes <kardianos@gmail.com>
gopherbot pushed a commit that referenced this issue Jul 16, 2020
Manually backported the subject CLs, because of lack of
Gerrit "forge-author" permissions, but also because the prior
cherry picks didn't apply cleanly, due to a tight relation chain.

The backport comprises of:
* CL 174122
* CL 216197
* CL 223963
* CL 216240
* CL 216241

Note:
Due to the restrictions that we cannot retroactively
introduce API changes to Go1.14.6 that weren't in Go1.14, the Conn.Validator
interface (from CL 174122, CL 223963) isn't exposed, and drivers will just be
inspected, for if they have an IsValid() bool method implemented.

For a description of the content of each CL:

* CL 174122:
database/sql: process all Session Resets synchronously

Adds a new interface, driver.ConnectionValidator, to allow
drivers to signal they should not be used again,
separatly from the session resetter interface.
This is done now that the session reset is done
after the connection is put into the connection pool.

Previous behavior attempted to run Session Resets
in a background worker. This implementation had two
problems: untested performance gains for additional
complexity, and failures when the pool size
exceeded the connection reset channel buffer size.

* CL 216197:
database/sql: check conn expiry when returning to pool, not when handing it out

With the original connection reuse strategy, it was possible that
when a new connection was requested, the pool would wait for an
an existing connection to return for re-use in a full connection
pool, and then it would check if the returned connection was expired.
If the returned connection expired while awaiting re-use, it would
return an error to the location requestiong the new connection.
The existing call sites requesting a new connection was often the last
attempt at returning a connection for a query. This would then
result in a failed query.

This change ensures that we perform the expiry check right
before a connection is inserted back in to the connection pool
for while requesting a new connection. If requesting a new connection
it will no longer fail due to the connection expiring.

* CL 216240:
database/sql: prevent Tx statement from committing after rollback

It was possible for a Tx that was aborted for rollback
asynchronously to execute a query after the rollback had completed
on the database, which often would auto commit the query outside
of the transaction.

By W-locking the tx.closemu prior to issuing the rollback
connection it ensures any Tx query either fails or finishes
on the Tx, and never after the Tx has rolled back.

* CL 216241:
database/sql: on Tx rollback, retain connection if driver can reset session

Previously the Tx would drop the connection after rolling back from
a context cancel. Now if the driver can reset the session,
keep the connection.

* CL 223963
database/sql: add test for Conn.Validator interface

This addresses comments made by Russ after
https://golang.org/cl/174122 was merged. It addes a test
for the connection validator and renames the interface to just
"Validator".

Updates #31480
Updates #32530
Updates #32942
Updates #34775
Fixes #39101

Change-Id: I043d2d724a367588689fd7d6f3cecb39abeb042c
Reviewed-on: https://go-review.googlesource.com/c/go/+/242102
Run-TryBot: Emmanuel Odeke <emm.odeke@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Daniel Theophanes <kardianos@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
8 participants
You can’t perform that action at this time.