Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

database/sql: context cancellation allows statements to execute after rollback #34775

Open
dadkins-at-dropbox opened this issue Oct 8, 2019 · 11 comments

Comments

@dadkins-at-dropbox
Copy link

@dadkins-at-dropbox dadkins-at-dropbox commented Oct 8, 2019

What version of Go are you using (go version)?

1.12.8

Does this issue reproduce with the latest release?

I haven't been able to run with the latest version, but examination of the code shows that the problem is still there.

What did you do?

Very rarely, I have observed problems resulting from transactions rolled back as a result of context cancellation. Namely, I have observed writes from presumably aborted transactions persist nonetheless. I added enough debugging output to the mysql driver to observe the following sequence:

driver.Tx.Rollback()
driver.Stmt.ExecContext()
driver.Conn.Close()

What did you expect to see?

The driver documentation at https://golang.org/pkg/database/sql/driver/#ConnBeginTx:

    // If the context is canceled by the user the sql package will
    // call Tx.Rollback before discarding and closing the connection.

Furthermore, the documentation promises that driver connections won't be used concurrently by multiple goroutines:
Conn is a connection to a database. It is not used concurrently by multiple goroutines.

What did you see instead?

I believe that contract was violated and that there is a gap between the "rollback" and the "close" in which a statement was allowed to execute. Since MySQL defaults to autocommit, that statement which we thought was executing within a transaction (which got rolled back) actually executed outside the transaction and committed.

I believe the bug is in tx.rollback in sql.go:

// rollback aborts the transaction and optionally forces the pool to discard
// the connection.
func (tx *Tx) rollback(discardConn bool) error {
	if !atomic.CompareAndSwapInt32(&tx.done, 0, 1) {
		return ErrTxDone
	}
	var err error
	withLock(tx.dc, func() {
		err = tx.txi.Rollback()
	})
	if err != driver.ErrBadConn {
		tx.closePrepared()
	}
	if discardConn {
		err = driver.ErrBadConn
	}
	tx.close(err)
	return err
}

The connection is locked to call driver.Tx.Rollback(), but nothing prevents a statement from executing before we get to tx.close(err).

@dadkins-at-dropbox

This comment has been minimized.

Copy link
Author

@dadkins-at-dropbox dadkins-at-dropbox commented Oct 8, 2019

Standalone repro:
https://gist.github.com/dadkins-at-dropbox/5b051b9ccb30f115c332c6037d03d63d

package main

import (
	"context"
	"database/sql"
	"fmt"
	"log"
	"os"
	"time"

	_ "github.com/go-sql-driver/mysql"
)

func main() {
	if len(os.Args) != 2 {
		log.Fatalf("usage: %s <dbname>", os.Args[0])
	}
	dbName := os.Args[1]
	db, err := sql.Open("mysql", "root@/"+dbName)
	if err != nil {
		log.Fatal(err)
	}
	defer db.Close()

	db.SetMaxOpenConns(100)
	db.SetMaxIdleConns(100)

	if _, err := db.Exec("drop table if exists t"); err != nil {
		log.Fatal(err)
	}
	if _, err := db.Exec("create table t (id int not null auto_increment primary key)"); err != nil {
		log.Fatal(err)
	}

	ins, err := db.Prepare("insert into t (id) values (null)")
	if err != nil {
		log.Fatal(err)
	}

	sel, err := db.Prepare("select count(*) from t")
	if err != nil {
		log.Fatal(err)
	}

	for i := 0; i < 10; i++ {
		go func() {
			for i := 0; ; i++ {
				runTx(db, ins)
			}
		}()
	}
	for range time.Tick(time.Second) {
		var n int
		r := sel.QueryRow()
		if err := r.Scan(&n); err != nil {
			fmt.Println("select", err)
			continue
		}
		fmt.Printf("%d rows (should be 0)\n", n)
	}
}

func runTx(db *sql.DB, stmt *sql.Stmt) error {
	ctx, cancel := context.WithCancel(context.Background())
	quit := make(chan struct{})
	done := make(chan struct{})
	tx, err := db.BeginTx(ctx, nil)
	if err != nil {
		fmt.Println("begin", err)
		return err
	}
	defer tx.Rollback()

	stmt = tx.Stmt(stmt)

	go func() {
		timer := time.NewTimer(time.Microsecond)
		defer timer.Stop()
		select {
		case <-quit:
		case <-timer.C:
		}
		cancel()
		close(done)
	}()
	defer func() {
		close(quit)
		<-done
	}()

	for i := 0; i < 10; i++ {
		if _, err := stmt.Exec(); err != nil {
			return err
		}
	}
	_ = tx.Rollback()
	return nil
}

outputs
1 rows (should be 0)
3 rows (should be 0)
4 rows (should be 0)
5 rows (should be 0)
^Csignal: interrupt

when you should never succeed in inserting any rows, since the transaction is always rolled back.

$ go version
go version go1.13.1 darwin/amd64

@methane

This comment was marked as resolved.

Copy link
Contributor

@methane methane commented Oct 8, 2019

You used stmt.Exec(). It doesn't use transaction tx.
You should use tx.Exec() instead.

@dadkins-at-dropbox

This comment has been minimized.

Copy link
Author

@dadkins-at-dropbox dadkins-at-dropbox commented Oct 8, 2019

stmt = tx.Stmt(stmt) associates the prepared statement with the transaction, and that's the one the repro uses.

@methane

This comment has been minimized.

Copy link
Contributor

@methane methane commented Oct 9, 2019

I'm sorry, I've missed it. I confirm your repro.

@andybons

This comment has been minimized.

Copy link
Member

@andybons andybons commented Oct 10, 2019

@andybons andybons added this to the Unplanned milestone Oct 10, 2019
@dadkins-at-dropbox

This comment has been minimized.

Copy link
Author

@dadkins-at-dropbox dadkins-at-dropbox commented Oct 11, 2019

Here's another log trace, via the mysql driver, of driver.Tx.Rollback being called in the middle of a prepared driver.Stmt.ExecContext and wreaking havoc, because that driver is not expecting those calls to be interleaved:

[mysql] 2019/10/10 23:25:59 connection.go:577: [0xc000348fc0]stmt.ExecContext

*** COM_STMT_EXECUTE
[mysql] 2019/10/10 23:25:59 packets.go:155: [0xc000348fc0] :write len=10 seq=0
[mysql] 2019/10/10 23:25:59 packets.go:156: [0xc000348fc0]
00000000  17 01 00 00 00 00 01 00  00 00                    |..........|

*** column count
[mysql] 2019/10/10 23:25:59 packets.go:45: [0xc000348fc0] :read len=1 seq=1
[mysql] 2019/10/10 23:25:59 packets.go:80: [0xc000348fc0]
00000000  01                                                |.|

*** column def
[mysql] 2019/10/10 23:25:59 packets.go:45: [0xc000348fc0] :read len=37 seq=2
[mysql] 2019/10/10 23:25:59 packets.go:80: [0xc000348fc0]
00000000  03 64 65 66 00 00 00 0f  43 4f 4e 4e 45 43 54 49  |.def....CONNECTI|
00000010  4f 4e 5f 49 44 28 29 00  0c 3f 00 15 00 00 00 08  |ON_ID()..?......|
00000020  a1 00 00 00 00                                    |.....|

*** EOF
[mysql] 2019/10/10 23:25:59 packets.go:45: [0xc000348fc0] :read len=5 seq=3
[mysql] 2019/10/10 23:25:59 packets.go:80: [0xc000348fc0]
00000000  fe 00 00 03 00                                    |.....|

*** RACE: stmt.ExecContext is still active, shouldn't be calling tx.Rollback ***
[mysql] 2019/10/10 23:25:59 transaction.go:30: [0xc000348fc0]tx.Rollback
[mysql] 2019/10/10 23:25:59 packets.go:454: busy buffer
*** resets mc.sequence = 0 ***

*** packet header for row...
[mysql] 2019/10/10 23:25:59 packets.go:45: [0xc000348fc0] :read len=10 seq=4
*** ErrPktSyncMul? since mc.sequence got reset ***

*** disaster, we've treating the packet body as a new packet header, and the sequence number matches! we're going to lunch now while we wait to read 10MB that's never coming.
[mysql] 2019/10/10 23:25:59 packets.go:45: [0xc000348fc0] :read len=10027008 seq=0
@dadkins-at-dropbox

This comment has been minimized.

Copy link
Author

@dadkins-at-dropbox dadkins-at-dropbox commented Nov 5, 2019

It would be great if someone more familiar with this code could take a look. I think I've pretty much nailed down the problem, but I'm not really in a good position to propose a fix to this code which doesn't introduce new problems or break some intricate assumptions.

I've confirmed the following race:

Goroutine 1:
(*Stmt).ExecContext begins a prepared statement in a transaction.
(*Tx).grabConn obtains tx.closemu in read mode and returns a valid (*driveConn)

Goroutine 2:
(*Tx).awaitDone awakes on canceled context and calls (*Tx).rollback
(*Tx).rollback atomically changes tx.done from 0 to 1
(*Tx).rollback calls tx.txi.Rollback() while holding (*driverConn) mutex.

Goroutine 1:
(*Stmt).ExecContext calls resultFromStatement with *driverConn and *driverStmt
resultFromStatement locks ds, which is actually dc.mutex.
resultFromStatement executes statement with driverStmt.ExecContext
resultFromStatement unlocks dc.mutex.

Goroutine 2:
(*Tx).rollback closes prepared statements and calls (*Tx).close
(*Tx).close calls tx.cancel(), which is a no-op.
(*Tx).close calls tx.releaseConn() while holding tx.closemu.

It definitely allows a statement to execute on a connection after a rollback caused by context cancellation, which under mysql autocommits that statement unexpectedly.

Examining the code, it seems like we should hold tx.closemu for the duration of tx.Close or tx.Rollback, certainly while we're changing the state of tx.done. I tried the following workaround:

diff --git a/src/database/sql/sql.go b/src/database/sql/sql.go
index 0f5bbc01c9..abc892dc08 100644
--- a/src/database/sql/sql.go
+++ b/src/database/sql/sql.go
@@ -2021,8 +2021,8 @@ var ErrTxDone = errors.New("sql: transaction has already been committed or rolle
 func (tx *Tx) close(err error) {
        tx.cancel()
 
-       tx.closemu.Lock()
-       defer tx.closemu.Unlock()
+       //tx.closemu.Lock()
+       //defer tx.closemu.Unlock()
 
        tx.releaseConn(err)
        tx.dc = nil
@@ -2087,6 +2087,10 @@ func (tx *Tx) Commit() error {
                }
                return tx.ctx.Err()
        }
+
+       tx.closemu.Lock()
+       defer tx.closemu.Unlock()
+
        if !atomic.CompareAndSwapInt32(&tx.done, 0, 1) {
                return ErrTxDone
        }
@@ -2104,6 +2108,12 @@ func (tx *Tx) Commit() error {
 // rollback aborts the transaction and optionally forces the pool to discard
 // the connection.
 func (tx *Tx) rollback(discardConn bool) error {
+       // Problem: We can't lock tx.closemu while an oustanding
+       // Rows holding tx.closemu in R mode.
+       // We must call tx.cancel to release that first.
+       tx.cancel()
+       tx.closemu.Lock()
+       defer tx.closemu.Unlock()
        if !atomic.CompareAndSwapInt32(&tx.done, 0, 1) {
                return ErrTxDone
        }

It was unsuccessful.

First, I broke Issues 18719 #18719. The fix for that issue (https://go-review.googlesource.com/c/go/+/35550/) introduced the closemu mutex, which was supposed to be held exclusively while closing a tx and releasing its conn back to the pool, while queries hold closemu in read mode to prevent the close from occurring mid-query. Its test inserted a hook into tx.grabConn which expects to be able to wait for tx.done to change from 0 to 1 while it sits there holding tx.closemu in read mode. That no longer works, but the test could be changed.

The fix to Issue 18719 caused another issue: #20575. Namely, *Rows objects hold closemu in read mode, which prevents us from rolling back a transaction without first calling (*Rows).Close. The fix for that issue (https://go-review.googlesource.com/c/go/+/44812/) sticks a call to tx.cancel() in tx.close() prior to obtaining tx.closemu, which the (*Rows) code catches in order to unlock tx.closemu. My attempted fix broke this, because I moved the call to tx.closemu.Lock() earlier.

I can somewhat workaround this problem by inserting an extra call to tx.cancel() in tx.Rollback, but that's hacky and introduces a race between tx.rollback and tx.awaitDone also calling tx.rollback.

I think I'm close to a reasonable solution, but I don't quite know how to resolve the problems with tx.closemu. We need the exclusion between the various queries we can run on a transaction, and commit/rollback. But, perhaps a mutex is not the right solution to hold across multiple calls while iterating over a resultset. Or perhaps context cancellation is not the right mechanism to reclaim a connection while iterating over a rowset.

@dadkins-at-dropbox

This comment has been minimized.

Copy link
Author

@dadkins-at-dropbox dadkins-at-dropbox commented Nov 6, 2019

I now have a plausible solution, and a test case that it fixes (in addition to the repro program already posted). I'm awaiting approval from my employer to submit the change for review, but here it is in the meantime if anyone's interested. My company only uses even numbered versions of Go, so I'd like to get this fix into 1.14 if possible:

diff --git a/src/database/sql/sql.go b/src/database/sql/sql.go
index 0f5bbc01c9..a0fd05423c 100644
--- a/src/database/sql/sql.go
+++ b/src/database/sql/sql.go
@@ -2018,12 +2018,10 @@ var ErrTxDone = errors.New("sql: transaction has already been committed or rolle
 
 // close returns the connection to the pool and
 // must only be called by Tx.rollback or Tx.Commit.
+//
+// The transaction context must already have been canceled,
+// and closemu must locked.
 func (tx *Tx) close(err error) {
-       tx.cancel()
-
-       tx.closemu.Lock()
-       defer tx.closemu.Unlock()
-
        tx.releaseConn(err)
        tx.dc = nil
        tx.txi = nil
@@ -2090,6 +2088,18 @@ func (tx *Tx) Commit() error {
        if !atomic.CompareAndSwapInt32(&tx.done, 0, 1) {
                return ErrTxDone
        }
+
+       // cancel is called after done transitions from 0 to 1, but
+       // before we attempt to grab closemu, since context cancellation
+       // will cause any operations holding closemu in read mode to
+       // release their locks.
+       tx.cancel()
+
+       // closemu must be acquired exclusively before rolling back to
+       // allow any oustanding operations to finish.
+       tx.closemu.Lock()
+       defer tx.closemu.Unlock()
+
        var err error
        withLock(tx.dc, func() {
                err = tx.txi.Commit()
@@ -2107,6 +2117,18 @@ func (tx *Tx) rollback(discardConn bool) error {
        if !atomic.CompareAndSwapInt32(&tx.done, 0, 1) {
                return ErrTxDone
        }
+
+       // cancel is called after done transitions from 0 to 1, but
+       // before we attempt to grab closemu, since context cancellation
+       // will cause any operations holding closemu in read mode to
+       // release their locks.
+       tx.cancel()
+
+       // closemu must be acquired exclusively before rolling back to
+       // allow any oustanding operations to finish.
+       tx.closemu.Lock()
+       defer tx.closemu.Unlock()
+
        var err error
        withLock(tx.dc, func() {
                err = tx.txi.Rollback()
@methane

This comment has been minimized.

Copy link
Contributor

@methane methane commented Nov 8, 2019

Your patch looks good to me. Would you create a pull request?

@dadkins-at-dropbox

This comment has been minimized.

Copy link
Author

@dadkins-at-dropbox dadkins-at-dropbox commented Nov 8, 2019

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Jan 24, 2020

Change https://golang.org/cl/216240 mentions this issue: database/sql: prevent Tx statement from committing after rollback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.